Files in Action

Let’s work through a simple example that demonstrates file-processing basics. The following code begins by opening a new text file for output, writing two lines (strings terminated with a newline marker, \n), and closing the file. Later, the example opens the same file again in input mode and reads the lines back one at a time with readline. Notice that the third readline call returns an empty string; this is how Python file methods tell you that you’ve reached the end of the file (empty lines in the file come back as strings containing just a newline character, not as empty strings). Here’s the complete interaction:

>>> myfile = open('myfile.txt', 'w')        # Open for text output: create/empty
>>> myfile.write('hello text file\n')       # Write a line of text: string
16
>>> myfile.write('goodbye text file\n')
18
>>> myfile.close()                          # Flush output buffers to disk

>>> myfile = open('myfile.txt')             # Open for text input: 'r' is default
>>> myfile.readline()                       # Read the lines back
'hello text file\n'
>>> myfile.readline()
'goodbye text file\n'
>>> myfile.readline()                       # Empty string: end of file
''

广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元

Notice that file write calls return the number of characters written in Python 3.0; in 2.6 they don’t, so you won’t see these numbers echoed interactively. This example writes each line of text, including its end-of-line terminator, \n, as a string; write methods don’t add the end-of-line character for us, so we must include it to properly terminate our lines (otherwise the next write will simply extend the current line in the file).

If you want to display the file’s content with end-of-line characters interpreted, read the entire file into a string all at once with the file object’s read method and print it:

>>> open('myfile.txt').read()               # Read all at once into string
'hello text file\ngoodbye text file\n'

>>> print(open('myfile.txt').read())        # User-friendly display
hello text file
goodbye text file

And if you want to scan a text file line by line, file iterators are often your best option:

>>> for line in open('myfile'):             # Use file iterators, not reads
...     print(line, end='')
...
hello text file
goodbye text file

When coded this way, the temporary file object created by open will automatically read and return one line on each loop iteration. This form is usually easiest to code, good on memory use, and may be faster than some other options (depending on many variables, of course). Since we haven’t reached statements or iterators yet, though, you’ll have to wait until Chapter 14 for a more complete explanation of this code.

Text and binary files in Python 3.0

Strictly speaking, the example in the prior section uses text files. In both Python 3.0 and 2.6, file type is determined by the second argument to open, the mode string—an included “b” means binary. Python has always supported both text and binary files, but in Python 3.0 there is a sharper distinction between the two:

 

 
  • Text files represent content as normal str strings, perform Unicode encoding and decoding automatically, and perform end-of-line translation by default.
  • Binary files represent content as a special bytes string type and allow programs to access file content unaltered.

In contrast, Python 2.6 text files handle both 8-bit text and binary data, and a special string type and file interface (unicode strings and codecs.open) handles Unicode text. The differences in Python 3.0 stem from the fact that simple and Unicode text have been merged in the normal string type—which makes sense, given that all text is Unicode, including ASCII and other 8-bit encodings.

Because most programmers deal only with ASCII text, they can get by with the basic text file interface used in the prior example, and normal strings. All strings are technically Unicode in 3.0, but ASCII users will not generally notice. In fact, files and strings work the same in 3.0 and 2.6 if your script’s scope is limited to such simple forms of text.

If you need to handle internationalized applications or byte-oriented data, though, the distinction in 3.0 impacts your code (usually for the better). In general, you must use bytes strings for binary files, and normal str strings for text files. Moreover, because text files implement Unicode encodings, you cannot open a binary data file in text mode—decoding its content to Unicode text will likely fail.

Let’s look at an example. When you read a binary data file you get back a bytes object—a sequence of small integers that represent absolute byte values (which may or may not correspond to characters), which looks and feels almost exactly like a normal string:

>>> data = open('data.bin', 'rb').read()    # Open binary file: rb=read binary
>>> data                                    # bytes string holds binary data
b'\x00\x00\x00\x07spam\x00\x08'
>>> data[4:8]                               # Act like strings
b'spam'
>>> data[4:8][0]                            # But really are small 8-bit integers
115
>>> bin(data[4:8][0])                       # Python 3.0 bin() function
'0b1110011'

In addition, binary files do not perform any end-of-line translation on data; text files by default map all forms to and from \n when written and read and implement Unicode encodings on transfers. Since Unicode and binary data is of marginal interest to many Python programmers, we’ll postpone the full story until Chapter 36. For now, let’s move on to some more substantial file examples.

Storing and parsing Python objects in files

Our next example writes a variety of Python objects into a text file on multiple lines. Notice that it must convert objects to strings using conversion tools. Again, file data is always strings in our scripts, and write methods do not do any automatic to-string formatting for us (for space, I’m omitting byte-count return values from write methods from here on):

>>> X, Y, Z = 43, 44, 45                    # Native Python objects
>>> S = 'Spam'                              # Must be strings to store in file
>>> D = {'a': 1, 'b': 2}
>>> L = [1, 2, 3]
>>>
>>> F = open('datafile.txt', 'w')           # Create output file
>>> F.write(S + '\n')                       # Terminate lines with \n
>>> F.write('%s,%s,%s\n' % (X, Y, Z))       # Convert numbers to strings
>>> F.write(str(L) + '$' + str(D) + '\n')   # Convert and separate with $
>>> F.close()

Once we have created our file, we can inspect its contents by opening it and reading it into a string (a single operation). Notice that the interactive echo gives the exact byte contents, while the print operation interprets embedded end-of-line characters to render a more user-friendly display:

>>> chars = open('datafile.txt').read()        # Raw string display
>>> chars
"Spam\n43,44,45\n[1, 2, 3]${'a': 1, 'b': 2}\n"
>>> print(chars)                               # User-friendly display
Spam
43,44,45
[1, 2, 3]${'a': 1, 'b': 2}

We now have to use other conversion tools to translate from the strings in the text file to real Python objects. As Python never converts strings to numbers (or other types of objects) automatically, this is required if we need to gain access to normal object tools like indexing, addition, and so on:

>>> F = open('datafile.txt')                  # Open again
>>> line = F.readline()                       # Read one line
>>> line
'Spam\n'
>>> line.rstrip()                             # Remove end-of-line
'Spam'

For this first line, we used the string rstrip method to get rid of the trailing end-of-line character; a line[:−1] slice would work, too, but only if we can be sure all lines end in the \n character (the last line in a file sometimes does not).

So far, we’ve read the line containing the string. Now let’s grab the next line, which contains numbers, and parse out (that is, extract) the objects on that line:

>>> line = F.readline()                       # Next line from file
>>> line                                      # It's a string here
'43,44,45\n'
>>> parts = line.split(',')                   # Split (parse) on commas
>>> parts
['43', '44', '45\n']

We used the string split method here to chop up the line on its comma delimiters; the result is a list of substrings containing the individual numbers. We still must convert from strings to integers, though, if we wish to perform math on these:

>>> int(parts[1])                             # Convert from string to int
44
>>> numbers = [int(P) for P in parts]         # Convert all in list at once
>>> numbers
[43, 44, 45]

As we have learned, int translates a string of digits into an integer object, and the list comprehension expression introduced in Chapter 4 can apply the call to each item in our list all at once (you’ll find more on list comprehensions later in this book). Notice that we didn’t have to run rstrip to delete the \n at the end of the last part; int and some other converters quietly ignore whitespace around digits.

Finally, to convert the stored list and dictionary in the third line of the file, we can run them through eval, a built-in function that treats a string as a piece of executable program code (technically, a string containing a Python expression):

>>> line = F.readline()
>>> line
"[1, 2, 3]${'a': 1, 'b': 2}\n"
>>> parts = line.split('$')                   # Split (parse) on $
>>> parts
['[1, 2, 3]', "{'a': 1, 'b': 2}\n"]
>>> eval(parts[0])                            # Convert to any object type
[1, 2, 3]
>>> objects = [eval(P) for P in parts]        # Do same for all in list
>>> objects
[[1, 2, 3], {'a': 1, 'b': 2}]

Because the end result of all this parsing and converting is a list of normal Python objects instead of strings, we can now apply list and dictionary operations to them in our script.

Storing native Python objects with pickle

Using eval to convert from strings to objects, as demonstrated in the preceding code, is a powerful tool. In fact, sometimes it’s too powerful. eval will happily run any Python expression—even one that might delete all the files on your computer, given the necessary permissions! If you really want to store native Python objects, but you can’t trust the source of the data in the file, Python’s standard library pickle module is ideal.

The pickle module is an advanced tool that allows us to store almost any Python object in a file directly, with no to- or from-string conversion requirement on our part. It’s like a super-general data formatting and parsing utility. To store a dictionary in a file, for instance, we pickle it directly:

>>> D = {'a': 1, 'b': 2}
>>> F = open('datafile.pkl', 'wb')
>>> import pickle
>>> pickle.dump(D, F)                         # Pickle any object to file
>>> F.close()

Then, to get the dictionary back later, we simply use pickle again to re-create it:

>>> F = open('datafile.pkl', 'rb')
>>> E = pickle.load(F)                        # Load any object from file
>>> E
{'a': 1, 'b': 2}

We get back an equivalent dictionary object, with no manual splitting or converting required. The pickle module performs what is known as object serialization—converting objects to and from strings of bytes—but requires very little work on our part. In fact, pickle internally translates our dictionary to a string form, though it’s not much to look at (and may vary if we pickle in other data protocol modes):

>>> open('datafile.pkl', 'rb').read()         # Format is prone to change!
b'\x80\x03}q\x00(X\x01\x00\x00\x00aq\x01K\x01X\x01\x00\x00\x00bq\x02K\x02u.'

Because pickle can reconstruct the object from this format, we don’t have to deal with that ourselves. For more on the pickle module, see the Python standard library manual, or import pickle and pass it to help interactively. While you’re exploring, also take a look at the shelve module. shelve is a tool that uses pickle to store Python objects in an access-by-key filesystem, which is beyond our scope here (though you will get to see an example of shelve in action in Chapter 27, and other pickle examples in Chapters 30 and 36).


Note

Note that I opened the file used to store the pickled object in binary mode; binary mode is always required in Python 3.0, because the pickler creates and uses a bytes string object, and these objects imply binary-mode files (text-mode files imply str strings in 3.0). In earlier Pythons it’s OK to use text-mode files for protocol 0 (the default, which creates ASCII text), as long as text mode is used consistently; higher protocols require binary-mode files. Python 3.0’s default protocol is 3 (binary), but it creates bytes even for protocol 0. See Chapter 36, Python’s library manual, or reference books for more details on this.

Python 2.6 also has a cPickle module, which is an optimized version of pickle that can be imported directly for speed. Python 3.0 renames this module _pickle and uses it automatically in pickle—scripts simply import pickle and let Python optimize itself.


Storing and parsing packed binary data in files

One other file-related note before we move on: some advanced applications also need to deal with packed binary data, created perhaps by a C language program. Python’s standard library includes a tool to help in this domain—the struct module knows how to both compose and parse packed binary data. In a sense, this is another data-conversion tool that interprets strings in files as binary data.

To create a packed binary data file, for example, open it in 'wb' (write binary) mode, and pass struct a format string and some Python objects. The format string used here means pack as a 4-byte integer, a 4-character string, and a 2-byte integer, all in big-endian form (other format codes handle padding bytes, floating-point numbers, and more):

>>> F = open('data.bin', 'wb')                     # Open binary output file
>>> import struct
>>> data = struct.pack('>i4sh', 7, 'spam', 8)      # Make packed binary data
>>> data
b'\x00\x00\x00\x07spam\x00\x08'
>>> F.write(data)                                  # Write byte string
>>> F.close()

Python creates a binary bytes data string, which we write out to the file normally—this one consists mostly of nonprintable characters printed in hexadecimal escapes, and is the same binary file we met earlier. To parse the values out to normal Python objects, we simply read the string back and unpack it using the same format string. Python extracts the values into normal Python objects—integers and a string:

>>> F = open('data.bin', 'rb')
>>> data = F.read()                                # Get packed binary data
>>> data
b'\x00\x00\x00\x07spam\x00\x08'
>>> values = struct.unpack('>i4sh', data)          # Convert to Python objects
>>> values
(7, 'spam', 8)

Binary data files are advanced and somewhat low-level tools that we won’t cover in more detail here; for more help, see Chapter 36, consult the Python library manual, or import struct and pass it to the help function interactively. Also note that the binary file-processing modes 'wb' and 'rb' can be used to process a simpler binary file such as an image or audio file as a whole without having to unpack its contents.

File context managers

You’ll also want to watch for Chapter 33’s discussion of the file’s context manager support, new in Python 3.0 and 2.6. Though more a feature of exception processing than files themselves, it allows us to wrap file-processing code in a logic layer that ensures that the file will be closed automatically on exit, instead of relying on the auto-close on garbage collection:

with open(r'C:\misc\data.txt') as myfile:        # See Chapter 33 for details
    for line in myfile:
        ...use line here...

The try/finally statement we’ll look at in Chapter 33 can provide similar functionality, but at some cost in extra code—three extra lines, to be precise (though we can often avoid both options and let Python close files for us automatically):

myfile = open(r'C:\misc\data.txt')
try:
    for line in myfile:
        ...use line here...
finally:
    myfile.close()

Since both these options require more information than we have yet obtained, we’ll postpone details until later in this book.