预计阅读本页时间:-
from_roman should fail with too many repeated numerals ... ok
from_roman should give known result with known input ... ok
to_roman should give known result with known input ... ok
广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元
from_roman(to_roman(n))==n for all n ... ok
to_roman should fail with negative input ... ok
to_roman should fail with non-integer input ... ok
to_roman should fail with large input ... ok
to_roman should fail with 0 input ... ok
----------------------------------------------------------------------
Ran 12 tests in 0.031s
①
OK
1. Not that you asked, but it’s fast, too! Like, almost 10× as fast. Of course, it’s not entirely a fair comparison,
because this version takes longer to import (when it builds the lookup tables). But since the import is only
done once, the startup cost is amortized over all the calls to the to_roman() and from_roman() functions.
Since the tests make several thousand function calls (the roundtrip test alone makes 10,000), this savings
adds up in a hurry!
The moral of the story?
• Simplicity is a virtue.
• Especially when regular expressions are involved.
• Unit tests can give you the confidence to do large-scale refactoring.
⁂
263
10.4. SUMMARY
Unit testing is a powerful concept which, if properly implemented, can both reduce maintenance costs and
increase flexibility in any long-term project. It is also important to understand that unit testing is not a
panacea, a Magic Problem Solver, or a silver bullet. Writing good test cases is hard, and keeping them up to
date takes discipline (especially when customers are screaming for critical bug fixes). Unit testing is not a
replacement for other forms of testing, including functional testing, integration testing, and user acceptance
testing. But it is feasible, and it does work, and once you’ve seen it work, you’ll wonder how you ever got
along without it.
These few chapters have covered a lot of ground, and much of it wasn’t even Python-specific. There are unit
testing frameworks for many languages, all of which require you to understand the same basic concepts:
• Designing test cases that are specific, automated, and independent
• Writing test cases before the code they are testing
• Writing tests that test good input and check for proper results
• Writing tests that test bad input and check for proper failure responses
• Writing and updating test cases to reflect new requirements
• Refactoring mercilessly to improve performance, scalability, readability, maintainability, or whatever other -ility
you’re lacking
264
CHAPTER 11. FILES
❝ A nine mile walk is no joke, especially in the rain. ❞
— Harry Kemelman, The Nine Mile Walk
11.1. DIVING IN
MyWindowslaptophad38,493filesbeforeIinstalledasingleapplication.InstallingPython3added
almost 3,000 files to that total. Files are the primary storage paradigm of every major operating system; the
concept is so ingrained that most people would have trouble imagining an alternative. Your computer is, metaphorically speaking, drowning in files.
11.2. READING FROM TEXT FILES
Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier:
a_file = open('examples/chinese.txt', encoding='utf-8')
Python has a built-in open() function, which takes a filename as an argument. Here the filename is
'examples/chinese.txt'. There are five interesting things about this filename:
1. It’s not just the name of a file; it’s a combination of a directory path and a filename. A hypothetical file-
opening function could have taken two arguments — a directory path and a filename — but the open()
function only takes one. In Python, whenever you need a “filename,” you can include some or all of a
directory path as well.
2. The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses
backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python,
forward slashes always Just Work, even on Windows.
3. The directory path does not begin with a slash or a drive letter, so it is called a relative path. Relative to
what, you might ask? Patience, grasshopper.
265
4. It’s a string. All modern operating systems (even Windows!) use Unicode to store the names of files and
directories. Python 3 fully supports non-ASCII pathnames.
5. It doesn’t need to be on your local disk. You might have a network drive mounted. That “file” might be a
figment of an entirely virtual filesystem. If your computer considers it a file and can access it as a file, Python can open it.
But that call to the open() function didn’t stop at the filename. There’s another argument, called encoding.
Oh dear, that sounds dreadfully familiar.
11.2.1. CHARACTER ENCODING REARS ITS UGLY HEAD
Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text
file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes
the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters
(otherwise known as a string).
# This example was created on Windows. Other platforms may
# behave differently, for reasons outlined below.
>>> file = open('examples/chinese.txt')
>>> a_string = file.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: character maps to <undefined>
>>>
266
What just happened? You didn’t specify a character
encoding, so Python is forced to use the default
encoding. What’s the default encoding? If you look
closely at the traceback, you can see that it’s dying in
cp1252.py, meaning that Python is using CP-1252 as the
The default
default encoding here. (CP-1252 is a common encoding
on computers running Microsoft Windows.) The
encoding is
CP-1252 character set doesn’t support the characters
that are in this file, so the read fails with an ugly
platform-
UnicodeDecodeError.
dependent.
But wait, it’s worse than that! The default encoding is
platform-dependent, so this code might work on your
computer (if your default encoding is UTF-8), but then
it will fail when you distribute it to someone else
(whose default encoding is different, like CP-1252).
☞ If you need to get the default character encoding, import the locale module and call
locale.getpreferredencoding(). On my Windows laptop, it returns 'cp1252', but
on my Linux box upstairs, it returns 'UTF8'. I can’t even maintain consistency in my
own house! Your results may be different (even on Windows) depending on which
version of your operating system you have installed and how your regional/language
settings are configured. This is why it’s so important to specify the encoding every
time you open a file.
11.2.2. STREAM OBJECTS
So far, all we know is that Python has a built-in function called open(). The open() function returns a
stream object, which has methods and attributes for getting information about and manipulating a stream of
characters.
267
>>> a_file = open('examples/chinese.txt', encoding='utf-8')
>>> a_file.name
①
'examples/chinese.txt'
>>> a_file.encoding
②
'utf-8'
>>> a_file.mode
③
'r'
1. The name attribute reflects the name you passed in to the open() function when you opened the file. It is
not normalized to an absolute pathname.
2. Likewise, encoding attribute reflects the encoding you passed in to the open() function. If you didn’t specify
the encoding when you opened the file (bad developer!) then the encoding attribute will reflect
locale.getpreferredencoding().
3. The mode attribute tells you in which mode the file was opened. You can pass an optional mode parameter
to the open() function. You didn’t specify a mode when you opened this file, so Python defaults to 'r',
which means “open for reading only, in text mode.” As you’ll see later in this chapter, the file mode serves
several purposes; different modes let you write to a file, append to a file, or open a file in binary mode (in
which you deal with bytes instead of strings).
☞ The documentation for the open() function lists all the possible file modes.
11.2.3. READING DATA FROM A TEXT FILE
After you open a file for reading, you’ll probably want to read from it at some point.
>>> a_file = open('examples/chinese.txt', encoding='utf-8')
>>> a_file.read()
①
'Dive Into Python 是为有经验的程序员编写的一本 Python 书。\n'
>>> a_file.read()
②
''
268
1. Once you open a file (with the correct encoding), reading from it is just a matter of calling the stream
object’s read() method. The result is a string.
2. Perhaps somewhat surprisingly, reading the file again does not raise an exception. Python does not consider
reading past end-of-file to be an error; it simply returns an empty string.
What if you want to re-read a file?
# continued from the previous example
>>> a_file.read()
①
''
Always
>>> a_file.seek(0)
②
0
specify an
>>> a_file.read(16)
③
'Dive Into Python'
encoding
>>> a_file.read(1)
④
' '
parameter
>>> a_file.read(1)
'是'
when you
>>> a_file.tell()
⑤
20
open a file.
1. Since you’re still at the end of the file, further calls to
the stream object’s read() method simply return an
empty string.
2. The seek() method moves to a specific byte position in a file.
3. The read() method can take an optional parameter, the number of characters to read.
4. If you like, you can even read one character at a time.
5. 16 + 1 + 1 = … 20?
Let’s try that again.
269
# continued from the previous example
>>> a_file.seek(17)
①
17
>>> a_file.read(1)
②
'是'
>>> a_file.tell()
③
20
1. Move to the 17th byte.
2. Read one character.
3. Now you’re on the 20th byte.
Do you see it yet? The seek() and tell() methods always count bytes, but since you opened this file as
text, the read() method counts characters. Chinese characters require multiple bytes to encode in UTF-8.
The English characters in the file only require one byte each, so you might be misled into thinking that the
seek() and read() methods are counting the same thing. But that’s only true for some characters.
But wait, it gets worse!
>>> a_file.seek(18)
①
18
>>> a_file.read(1)
②
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
a_file.read(1)
File "C:\Python31\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte
1. Move to the 18th byte and try to read one character.
2. Why does this fail? Because there isn’t a character at the 18th byte. The nearest character starts at the 17th
byte (and goes for three bytes). Trying to read a character from the middle will fail with a
UnicodeDecodeError.
270
11.2.4. CLOSING FILES
Open files consume system resources, and depending on the file mode, other programs may not be able to
access them. It’s important to close files as soon as you’re finished with them.
# continued from the previous example
>>> a_file.close()
Well that was anticlimactic.
The stream object a_file still exists; calling its close() method doesn’t destroy the object itself. But it’s
not terribly useful.
# continued from the previous example
>>> a_file.read()
①
Traceback (most recent call last):
File "<pyshell#24>", line 1, in <module>
a_file.read()
ValueError: I/O operation on closed file.
>>> a_file.seek(0)
②
Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>
a_file.seek(0)
ValueError: I/O operation on closed file.
>>> a_file.tell()
③
Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
a_file.tell()
ValueError: I/O operation on closed file.
>>> a_file.close()
④
>>> a_file.closed
⑤
True
1. You can’t read from a closed file; that raises an IOError exception.
271
2. You can’t seek in a closed file either.
3. There’s no current position in a closed file, so the tell() method also fails.
4. Perhaps surprisingly, calling the close() method on a stream object whose file has been closed does not
raise an exception. It’s just a no-op.
5. Closed stream objects do have one useful attribute: the closed attribute will confirm that the file is closed.
11.2.5. CLOSING FILES AUTOMATICALLY
Stream objects have an explicit close() method, but
what happens if your code has a bug and crashes before
you call close()? That file could theoretically stay open
for much longer than necessary. While you’re debugging
on your local computer, that’s not a big deal. On a
try..finally
production server, maybe it is.
is good.
Python 2 had a solution for this: the try..finally
block. That still works in Python 3, and you may see it
with is
in other people’s code or in older code that was ported
to Python 3. But Python 2.6 introduced a cleaner
better.
solution, which is now the preferred solution in Python
3: the with statement.
with open('examples/chinese.txt', encoding='utf-8') as a_file:
a_file.seek(17)
a_character = a_file.read(1)
print(a_character)
This code calls open(), but it never calls a_file.close(). The with statement starts a code block, like an
if statement or a for loop. Inside this code block, you can use the variable a_file as the stream object
returned from the call to open(). All the regular stream object methods are available — seek(), read(),
whatever you need. When the with block ends, Python calls a_file.close() automatically.
272
Here’s the kicker: no matter how or when you exit the with block, Python will close that file… even if you
“exit” it via an unhandled exception. That’s right, even if your code raises an exception and your entire
program comes to a screeching halt, that file will get closed. Guaranteed.
☞ In technical terms, the with statement creates a runtime context. In these examples,
the stream object acts as a context manager. Python creates the stream object
a_file and tells it that it is entering a runtime context. When the with code block
is completed, Python tells the stream object that it is exiting the runtime context,
and the stream object calls its own close() method. See Appendix B, “Classes That
Can Be Used in a with Block” for details.
There’s nothing file-specific about the with statement; it’s just a generic framework for creating runtime
contexts and telling objects that they’re entering and exiting a runtime context. If the object in question is a
stream object, then it does useful file-like things (like closing the file automatically). But that behavior is
defined in the stream object, not in the with statement. There are lots of other ways to use context
managers that have nothing to do with files. You can even create your own, as you’ll see later in this
chapter.
11.2.6. READING DATA ONE LINE AT A TIME
A “line” of a text file is just what you think it is — you type a few words and press ENTER, and now you’re
on a new line. A line of text is a sequence of characters delimited by… what exactly? Well, it’s complicated,
because text files can use several different characters to mark the end of a line. Every operating system has
its own convention. Some use a carriage return character, others use a line feed character, and some use
both characters at the end of every line.
Now breathe a sigh of relief, because Python handles line endings automatically by default. If you say, “I want to
read this text file one line at a time,” Python will figure out which kind of line ending the text file uses and
and it will all Just Work.
☞
273
If you need fine-grained control over what’s considered a line ending, you can pass
the optional newline parameter to the open() function. See the open() function
documentation for all the gory details.
So, how do you actually do it? Read a file one line at a time, that is. It’s so simple, it’s beautiful.
line_number = 0
with open('examples/favorite-people.txt', encoding='utf-8') as a_file:
①
for a_line in a_file:
②
line_number += 1
print('{:>4} {}'.format(line_number, a_line.rstrip()))
③
1. Using the with pattern, you safely open the file and let Python close it for you.
2. To read a file one line at a time, use a for loop. That’s it. Besides having explicit methods like read(), the
stream object is also an iterator which spits out a single line every time you ask for a value.
3. Using the format() string method, you can print out the line number and the line itself. The format specifier
{:>4} means “print this argument right-justified within 4 spaces.” The a_line variable contains the complete
line, carriage returns and all. The rstrip() string method removes the trailing whitespace, including the
carriage return characters.
you@localhost:~/diveintopython3$ python3 examples/oneline.py
1 Dora
2 Ethan
3 Wesley
4 John
5 Anne
6 Mike
7 Chris
8 Sarah
9 Alex
10 Lizzie
274
Did you get this error?
you@localhost:~/diveintopython3$ python3 examples/oneline.py
Traceback (most recent call last):
File "examples/oneline.py", line 4, in <module>
print('{:>4} {}'.format(line_number, a_line.rstrip()))
ValueError: zero length field name in format
If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1.
Python 3.0 supported string formatting, but only with explicitly numbered format specifiers.
Python 3.1 allows you to omit the argument indexes in your format specifiers. Here is the
Python 3.0-compatible version for comparison:
print('{0:>4} {1}'.format(line_number, a_line.rstrip()))
⁂
11.3. WRITING TO TEXT FILES
You can write to files in much the same way that you
read from them. First you open a file and get a stream
object, then you use methods on the stream object to
write data to the file, then you close the file.
Just open a
To open a file for writing, use the open() function and
specify the write mode. There are two file modes for
file and
writing:
• “Write” mode will overwrite the file. Pass mode='w' to
the open() function.
275
• “Append” mode will add data to the end of the file.
Pass mode='a' to the open() function.
Either mode will create the file automatically if it doesn’t
already exist, so there’s never a need for any sort of
start
fiddly “if the file doesn’t exist yet, create a new empty
file just so you can open it for the first time” function.
writing.
Just open a file and start writing.
You should always close a file as soon as you’re done
writing to it, to release the file handle and ensure that
the data is actually written to disk. As with reading data from a file, you can call the stream object’s close()
method, or you can use the with statement and let Python close the file for you. I bet you can guess which
technique I recommend.
>>> with open('test.log', mode='w', encoding='utf-8') as a_file:
①
...
a_file.write('test succeeded')
②
>>> with open('test.log', encoding='utf-8') as a_file:
...
print(a_file.read())
test succeeded
>>> with open('test.log', mode='a', encoding='utf-8') as a_file:
③
...
a_file.write('and again')
>>> with open('test.log', encoding='utf-8') as a_file:
...
print(a_file.read())
test succeededand again
④
1. You start boldly by creating the new file test.log (or overwriting the existing file), and opening the file for
writing. The mode='w' parameter means open the file for writing. Yes, that’s all as dangerous as it sounds. I
hope you didn’t care about the previous contents of that file (if any), because that data is gone now.
2. You can add data to the newly opened file with the write() method of the stream object returned by the
open() function. After the with block ends, Python automatically closes the file.
3. That was so fun, let’s do it again. But this time, with mode='a' to append to the file instead of overwriting
it. Appending will never harm the existing contents of the file.
276
4. Both the original line you wrote and the second line you appended are now in the file test.log. Also note
that neither carriage returns nor line feeds are included. Since you didn’t write them explicitly to the file
either time, the file doesn’t include them. You can write a carriage return with the '\r' character, and/or a
line feed with the '\n' character. Since you didn’t do either, everything you wrote to the file ended up on
one line.
11.3.1. CHARACTER ENCODING AGAIN
Did you notice the encoding parameter that got passed in to the open() function while you were opening a
file for writing? It’s important; don’t ever leave it out! As you saw in the beginning of this chapter, files don’t contain strings, they contain bytes. Reading a “string” from a text file only works because you told Python
what encoding to use to read a stream of bytes and convert it to a string. Writing text to a file presents the
same problem in reverse. You can’t write characters to a file; characters are an abstraction. In order to write to the file, Python needs to know how to convert your string into a sequence of bytes. The only way
to be sure it’s performing the correct conversion is to specify the encoding parameter when you open the
file for writing.
⁂
11.4. BINARY FILES
Not all files contain text. Some of them contain pictures of my dog.
277
>>> an_image = open('examples/beauregard.jpg', mode='rb')
①
>>> an_image.mode
②
'rb'
>>> an_image.name
③
'examples/beauregard.jpg'
>>> an_image.encoding
④
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: '_io.BufferedReader' object has no attribute 'encoding'
1. Opening a file in binary mode is simple but subtle. The only difference from opening it in text mode is that
the mode parameter contains a 'b' character.
2. The stream object you get from opening a file in binary mode has many of the same attributes, including
mode, which reflects the mode parameter you passed into the open() function.
3. Binary stream objects also have a name attribute, just like text stream objects.
4. Here’s one difference, though: a binary stream object has no encoding attribute. That makes sense, right?
You’re reading (or writing) bytes, not strings, so there’s no conversion for Python to do. What you get out
of a binary file is exactly what you put into it, no conversion necessary.
Did I mention you’re reading bytes? Oh yes you are.
278
# continued from the previous example
>>> an_image.tell()
0
>>> data = an_image.read(3)
①
>>> data
b'\xff\xd8\xff'
>>> type(data)
②
<class 'bytes'>
>>> an_image.tell()
③
3
>>> an_image.seek(0)
0
>>> data = an_image.read()
>>> len(data)
3150
1. Like text files, you can read binary files a little bit at a time. But there’s a crucial difference…
2. …you’re reading bytes, not strings. Since you opened the file in binary mode, the read() method takes the
number of bytes to read, not the number of characters.
3. That means that there’s never an unexpected mismatch between the number you passed into the read() method and the position index you get out of the tell() method. The read() method reads bytes, and the
seek() and tell() methods track the number of bytes read. For binary files, they’ll always agree.
⁂
279
11.5. STREAM OBJECTS FROM NON-FILE SOURCES
Imagine you’re writing a library, and one of your library
functions is going to read some data from a file. The
function could simply take a filename as a string, go
open the file for reading, read it, and close it before
exiting. But you shouldn’t do that. Instead, your API
To read
should take an arbitrary stream object.
from a fake
In the simplest case, a stream object is anything with a
read() method which takes an optional size parameter
file, just call
and returns a string. When called with no size
parameter, the read() method should read everything
read() .
there is to read from the input source and return all
the data as a single value. When called with a size
parameter, it reads that much from the input source
and returns that much data. When called again, it picks
up where it left off and returns the next chunk of data.
That sounds exactly like the stream object you get from opening a real file. The difference is that you’re not
limiting yourself to real files. The input source that’s being “read” could be anything: a web page, a string in
memory, even the output of another program. As long as your functions take a stream object and simply call
the object’s read() method, you can handle any input source that acts like a file, without specific code to
handle each kind of input.
280
>>> a_string = 'PapayaWhip is the new black.'
>>> import io
①
>>> a_file = io.StringIO(a_string)
②
>>> a_file.read()
③
'PapayaWhip is the new black.'
>>> a_file.read()
④
''
>>> a_file.seek(0)
⑤
0
>>> a_file.read(10)
⑥
'PapayaWhip'
>>> a_file.tell()
10
>>> a_file.seek(18)
18
>>> a_file.read()
'new black.'
1. The io module defines the StringIO class that you can use to treat a string in memory as a file.
2. To create a stream object out of a string, create an instance of the io.StringIO() class and pass it the
string you want to use as your “file” data. Now you have a stream object, and you can do all sorts of
stream-like things with it.
3. Calling the read() method “reads” the entire “file,” which in the case of a StringIO object simply returns
the original string.
4. Just like a real file, calling the read() method again returns an empty string.
5. You can explicitly seek to the beginning of the string, just like seeking through a real file, by using the
seek() method of the StringIO object.
6. You can also read the string in chunks, by passing a size parameter to the read() method.
☞ io.StringIO lets you treat a string as a text file. There’s also a io.BytesIO class,
which lets you treat a byte array as a binary file.
281
11.5.1. HANDLING COMPRESSED FILES
The Python standard library contains modules that support reading and writing compressed files. There are a
number of different compression schemes; the two most popular on non-Windows systems are gzip and
bzip2. (You may have also encountered PKZIP archives and GNU Tar archives. Python has modules for those, too.)
The gzip module lets you create a stream object for reading or writing a gzip-compressed file. The stream
object it gives you supports the read() method (if you opened it for reading) or the write() method (if
you opened it for writing). That means you can use the methods you’ve already learned for regular files to
directly read or write a gzip-compressed file, without creating a temporary file to store the decompressed data.
As an added bonus, it supports the with statement too, so you can let Python automatically close your gzip-
compressed file when you’re done with it.
you@localhost:~$ python3
>>> import gzip
>>> with gzip.open('out.log.gz', mode='wb') as z_file:
①
...
z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))
...
>>> exit()
you@localhost:~$ ls -l out.log.gz
②
-rw-r--r-- 1 mark mark 79 2009-07-19 14:29 out.log.gz
you@localhost:~$ gunzip out.log.gz
③
you@localhost:~$ cat out.log
④
A nine mile walk is no joke, especially in the rain.
1. You should always open gzipped files in binary mode. (Note the 'b' character in the mode argument.)
2. I constructed this example on Linux. If you’re not familiar with the command line, this command is showing
the “long listing” of the gzip-compressed file you just created in the Python Shell. This listing shows that the
file exists (good), and that it is 79 bytes long. That’s actually larger than the string you started with! The gzip
282
file format includes a fixed-length header that contains some metadata about the file, so it’s inefficient for
extremely small files.
3. The gunzip command (pronounced “gee-unzip”) decompresses the file and stores the contents in a new file
named the same as the compressed file but without the .gz file extension.
4. The cat command displays the contents of a file. This file contains the string you originally wrote directly to
the compressed file out.log.gz from within the Python Shell.
Did you get this error?
>>> with gzip.open('out.log.gz', mode='wb') as z_file:
...
z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GzipFile' object has no attribute '__exit__'
If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1.
Python 3.0 had a gzip module, but it did not support using a gzipped-file object as a
context manager. Python 3.1 added the ability to use gzipped-file objects in a with
statement.
⁂
283
11.6. STANDARD INPUT, OUTPUT, AND ERROR
Command-line gurus are already familiar with the
concept of standard input, standard output, and standard
error. This section is for the rest of you.
Standard output and standard error (commonly
sys.stdin ,
abbreviated stdout and stderr) are pipes that are built
into every UNIX-like system, including Mac OS X and
sys.stdout ,
Linux. When you call the print() function, the thing
you’re printing is sent to the stdout pipe. When your
sys.stderr .
program crashes and prints out a traceback, it goes to
the stderr pipe. By default, both of these pipes are just
connected to the terminal window where you are
working; when your program prints something, you see
the output in your terminal window, and when a program crashes, you see the traceback in your terminal
window too. In the graphical Python Shell, the stdout and stderr pipes default to your “Interactive
Window”.
>>> for i in range(3):
...
print('PapayaWhip')
①
PapayaWhip
PapayaWhip
PapayaWhip
>>> import sys
>>> for i in range(3):
... sys.stdout.write('is the')
②
is theis theis the
>>> for i in range(3):
... sys.stderr.write('new black')
③
new blacknew blacknew black
1. The print() function, in a loop. Nothing surprising here.
284
2. stdout is defined in the sys module, and it is a stream object. Calling its write() function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to
the end of the string you’re printing, and calls sys.stdout.write.
3. In the simplest case, sys.stdout and sys.stderr send their output to the same place: the Python IDE (if
you’re in one), or the terminal (if you’re running Python from the command line). Like standard output,
standard error does not add carriage returns for you. If you want carriage returns, you’ll need to write
carriage return characters.
sys.stdout and sys.stderr are stream objects, but they are write-only. Attempting to call their read()
method will always raise an IOError.
>>> import sys
>>> sys.stdout.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: not readable
11.6.1. REDIRECTING STANDARD OUTPUT
sys.stdout and sys.stderr are stream objects, albeit ones that only support writing. But they’re not
constants; they’re variables. That means you can assign them a new value — any other stream object — to
redirect their output.
285
import sys
class RedirectStdoutTo:
def __init__(self, out_new):
self.out_new = out_new
def __enter__(self):
self.out_old = sys.stdout
sys.stdout = self.out_new
def __exit__(self, *args):
sys.stdout = self.out_old
print('A')
with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
print('B')
print('C')
Check this out:
you@localhost:~/diveintopython3/examples$ python3 stdout.py
A
C
you@localhost:~/diveintopython3/examples$ cat out.log
B
Did you get this error?
you@localhost:~/diveintopython3/examples$ python3 stdout.py
File "stdout.py", line 15
with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
^
SyntaxError: invalid syntax
286
If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1.
Python 3.0 supported the with statement, but each statement can only use one context
manager. Python 3.1 allows you to chain multiple context managers in a single with
statement.
Let’s take the last part first.
print('A')
with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
print('B')
print('C')
That’s a complicated with statement. Let me rewrite it as something more recognizable.
with open('out.log', mode='w', encoding='utf-8') as a_file:
with RedirectStdoutTo(a_file):
print('B')
As the rewrite shows, you actually have two with statements, one nested within the scope of the other. The
“outer” with statement should be familiar by now: it opens a UTF-8-encoded text file named out.log for
writing and assigns the stream object to a variable named a_file. But that’s not the only thing odd here.
with RedirectStdoutTo(a_file):
Where’s the as clause? The with statement doesn’t actually require one. Just like you can call a function and
ignore its return value, you can have a with statement that doesn’t assign the with context to a variable. In
this case, you’re only interested in the side effects of the RedirectStdoutTo context.
What are those side effects? Take a look inside the RedirectStdoutTo class. This class is a custom context
manager. Any class can be a context manager by defining two special methods: __enter__() and __exit__().
287
class RedirectStdoutTo:
def __init__(self, out_new):
①
self.out_new = out_new
def __enter__(self):
②
self.out_old = sys.stdout
sys.stdout = self.out_new
def __exit__(self, *args):
③
sys.stdout = self.out_old
1. The __init__() method is called immediately after an instance is created. It takes one parameter, the
stream object that you want to use as standard output for the life of the context. This method just saves
the stream object in an instance variable so other methods can use it later.
2. The __enter__() method is a special class method; Python calls it when entering a context ( i.e. at the beginning of the with statement). This method saves the current value of sys.stdout in self.out_old,
then redirects standard output by assigning self.out_new to sys.stdout.
3. The __exit__() method is another special class method; Python calls it when exiting the context ( i.e. at the
end of the with statement). This method restores standard output to its original value by assigning the saved
self.out_old value to sys.stdout.
Putting it all together:
print('A')
①
with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):
②
print('B')
③
print('C')
④
1. This will print to the IDE “Interactive Window” (or the terminal, if running the script from the command
line).
2. This with statement takes a comma-separated list of contexts. The comma-separated list acts like a series of nested with blocks. The first context listed is the “outer” block; the last one listed is the “inner” block. The
first context opens a file; the second context redirects sys.stdout to the stream object that was created in
the first context.
288
3. Because this print() function is executed with the context created by the with statement, it will not print
to the screen; it will write to the file out.log.
4. The with code block is over. Python has told each context manager to do whatever it is they do upon
exiting a context. The context managers form a last-in-first-out stack. Upon exiting, the second context
changed sys.stdout back to its original value, then the first context closed the file named out.log. Since
standard output has been restored to its original value, calling the print() function will once again print to
the screen.
Redirecting standard error works exactly the same way, using sys.stderr instead of sys.stdout.
⁂
11.7. FURTHER READING
• Reading and writing files in the Python.org tutorial
289
CHAPTER 12. XML
❝ In the archonship of Aristaechmus, Draco enacted his ordinances. ❞
12.1. DIVING IN
Nearlyallthechaptersinthisbookrevolvearoundapieceofsamplecode.ButXML isn’taboutcode;
it’s about data. One common use of XML is “syndication feeds” that list the latest articles on a blog, forum,
or other frequently-updated website. Most popular blogging software can produce a feed and update it
whenever new articles, discussion threads, or blog posts are published. You can follow a blog by
“subscribing” to its feed, and you can follow multiple blogs with a dedicated “feed aggregator” like Google
Here, then, is the XML data we’ll be working with in this chapter. It’s a feed — specifically, an Atom
290
<?xml version='1.0' encoding='utf-8'?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
<title>dive into mark</title>
<subtitle>currently between addictions</subtitle>
<id>tag:diveintomark.org,2001-07-29:/</id>
<updated>2009-03-27T21:56:07Z</updated>
<link rel='alternate' type='text/html' href='http://diveintomark.org/'/>
<link rel='self' type='application/atom+xml' href='http://diveintomark.org/feed/'/>
<entry>
<author>
<name>Mark</name>
<uri>http://diveintomark.org/</uri>
</author>
<title>Dive into history, 2009 edition</title>
<link rel='alternate' type='text/html'
href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>
<id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>
<updated>2009-03-27T21:56:07Z</updated>
<published>2009-03-27T17:20:42Z</published>
<category scheme='http://diveintomark.org' term='diveintopython'/>
<category scheme='http://diveintomark.org' term='docbook'/>
<category scheme='http://diveintomark.org' term='html'/>
<summary type='html'>Putting an entire chapter on one page sounds
bloated, but consider this &mdash; my longest chapter so far
would be 75 printed pages, and it loads in under 5 seconds&hellip;
On dialup.</summary>
</entry>
<entry>
<author>
<name>Mark</name>
<uri>http://diveintomark.org/</uri>
</author>
<title>Accessibility is a harsh mistress</title>
<link rel='alternate' type='text/html'
291
href='http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'/>
<id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>
<updated>2009-03-22T01:05:37Z</updated>
<published>2009-03-21T20:09:28Z</published>
<category scheme='http://diveintomark.org' term='accessibility'/>
<summary type='html'>The accessibility orthodoxy does not permit people to
question the value of features that are rarely useful and rarely used.</summary>
</entry>
<entry>
<author>
<name>Mark</name>
</author>
<title>A gentle introduction to video encoding, part 1: container formats</title>
<link rel='alternate' type='text/html'
href='http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'/>
<id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>
<updated>2009-01-11T19:39:22Z</updated>
<published>2008-12-18T15:54:22Z</published>
<category scheme='http://diveintomark.org' term='asf'/>
<category scheme='http://diveintomark.org' term='avi'/>
<category scheme='http://diveintomark.org' term='encoding'/>
<category scheme='http://diveintomark.org' term='flv'/>
<category scheme='http://diveintomark.org' term='GIVE'/>
<category scheme='http://diveintomark.org' term='mp4'/>
<category scheme='http://diveintomark.org' term='ogg'/>
<category scheme='http://diveintomark.org' term='video'/>
<summary type='html'>These notes will eventually become part of a
tech talk on video encoding.</summary>
</entry>
</feed>
⁂
292
12.2. A 5-MINUTE CRASH COURSE IN XML
If you already know about XML, you can skip this section.
X M L is a generalized way of describing hierarchical structured data. An X M L document contains one or more
elements, which are delimited by start and end tags. This is a complete (albeit boring) XML document:
<foo>
①
</foo>
②
1. This is the start tag of the foo element.
2. This is the matching end tag of the foo element. Like balancing parentheses in writing or mathematics or
code, every start tag must be closed (matched) by a corresponding end tag.
Elements can be nested to any depth. An element bar inside an element foo is said to be a subelement or
child of foo.
<foo>
<bar></bar>
</foo>
The first element in every XML document is called the root element. An XML document can only have one
root element. The following is not an XML document, because it has two root elements:
<foo></foo>
<bar></bar>
Elements can have attributes, which are name-value pairs. Attributes are listed within the start tag of an
element and separated by whitespace. Attribute names can not be repeated within an element. Attribute values
must be quoted. You may use either single or double quotes.
<foo lang='en'>
①
<bar id=xml-'papayawhip'>lang="fr"></bar>
②
</foo>
293
1. The foo element has one attribute, named lang. The value of its lang attribute is en.
2. The bar element has two attributes, named id and lang. The value of its lang attribute is fr. This doesn’t
conflict with the foo element in any way. Each element has its own set of attributes.
If an element has more than one attribute, the ordering of the attributes is not significant. An element’s
attributes form an unordered set of keys and values, like a Python dictionary. There is no limit to the
number of attributes you can define on each element.
Elements can have text content.
<foo lang='en'>
<bar lang='fr'>PapayaWhip</bar>
</foo>
Elements that contain no text and no children are empty.
<foo></foo>
There is a shorthand for writing empty elements. By putting a / character in the start tag, you can skip the
end tag altogther. The XML document in the previous example could be written like this instead:
<foo/>
Like Python functions can be declared in different modules, XML elements can be declared in different
namespaces. Namespaces usually look like URLs. You use an xmlns declaration to define a default namespace.
A namespace declaration looks similar to an attribute, but it has a different purpose.
<feed xmlns='http://www.w3.org/2005/Atom'>
①
<title>dive into mark</title>
②
</feed>
1. The feed element is in the http://www.w3.org/2005/Atom namespace.
2. The title element is also in the http://www.w3.org/2005/Atom namespace. The namespace declaration
affects the element where it’s declared, plus all child elements.
294
You can also use an xmlns:prefix declaration to define a namespace and associate it with a prefix. Then
each element in that namespace must be explicitly declared with the prefix.
<atom:feed xmlns:atom='http://www.w3.org/2005/Atom'>
①
<atom:title>dive into mark</atom:title>
②
</atom:feed>
1. The feed element is in the http://www.w3.org/2005/Atom namespace.
2. The title element is also in the http://www.w3.org/2005/Atom namespace.
As far as an XML parser is concerned, the previous two XML documents are identical. Namespace + element
name = XML identity. Prefixes only exist to refer to namespaces, so the actual prefix name (atom:) is
irrelevant. The namespaces match, the element names match, the attributes (or lack of attributes) match, and
each element’s text content matches, therefore the XML documents are the same.
Finally, XML documents can contain character encoding information on the first line, before the root element. (If you’re curious how a document can contain information which needs to be known before the
document can be parsed, Section F of the XML specification details how to resolve this Catch-22.)
<?xml version='1.0' encoding='utf-8'?>
And now you know just enough XML to be dangerous!
⁂
12.3. THE STRUCTURE OF AN ATOM FEED
Think of a weblog, or in fact any website with frequently updated content, like CNN.com. The site itself has a title (“CNN.com”), a subtitle (“Breaking News, U.S., World, Weather, Entertainment & Video News”), a
last-updated date (“updated 12:43 p.m. EDT, Sat May 16, 2009”), and a list of articles posted at different
times. Each article also has a title, a first-published date (and maybe also a last-updated date, if they published
a correction or fixed a typo), and a unique URL.
295
The Atom syndication format is designed to capture all of this information in a standard format. My weblog
and CNN.com are wildly different in design, scope, and audience, but they both have the same basic
structure. CNN.com has a title; my blog has a title. CNN.com publishes articles; I publish articles.
At the top level is the root element, which every Atom feed shares: the feed element in the
http://www.w3.org/2005/Atom namespace.
<feed xmlns='http://www.w3.org/2005/Atom'
①
xml:lang='en'>
②
1. http://www.w3.org/2005/Atom is the Atom namespace.
2. Any element can contain an xml:lang attribute, which declares the language of the element and its children.
In this case, the xml:lang attribute is declared once on the root element, which means the entire feed is in
English.
An Atom feed contains several pieces of information about the feed itself. These are declared as children of
the root-level feed element.
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
<title>dive into mark</title>
①
<subtitle>currently between addictions</subtitle>
②
<id>tag:diveintomark.org,2001-07-29:/</id>
③
<updated>2009-03-27T21:56:07Z</updated>
④
<link rel='alternate' type='text/html' href='http://diveintomark.org/'/>
⑤
1. The title of this feed is dive into mark.
2. The subtitle of this feed is currently between addictions.
3. Every feed needs a globally unique identifier. See RFC 4151 for how to create one.
4. This feed was last updated on March 27, 2009, at 21:56 GMT. This is usually equivalent to the last-modified
date of the most recent article.
5. Now things start to get interesting. This link element has no text content, but it has three attributes: rel,
type, and href. The rel value tells you what kind of link this is; rel='alternate' means that this is a link
to an alternate representation of this feed. The type='text/html' attribute means that this is a link to an
H T M L page. And the link target is given in the href attribute.
296
Now we know that this is a feed for a site named “dive into mark“ which is available at
http://diveintomark.org/ and was last updated on March 27, 2009.
☞ Although the order of elements can be relevant in some XML documents, it is not
relevant in an Atom feed.
After the feed-level metadata is the list of the most recent articles. An article looks like this:
<entry>
<author>
①
<name>Mark</name>
<uri>http://diveintomark.org/</uri>
</author>
<title>Dive into history, 2009 edition</title>
②
<link rel='alternate' type='text/html'
③
href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>
<id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>
④
<updated>2009-03-27T21:56:07Z</updated>
⑤
<published>2009-03-27T17:20:42Z</published>
<category scheme='http://diveintomark.org' term='diveintopython'/>
⑥
<category scheme='http://diveintomark.org' term='docbook'/>
<category scheme='http://diveintomark.org' term='html'/>
<summary type='html'>Putting an entire chapter on one page sounds
⑦
bloated, but consider this &mdash; my longest chapter so far
would be 75 printed pages, and it loads in under 5 seconds&hellip;
On dialup.</summary>
</entry>
⑧
1. The author element tells who wrote this article: some guy named Mark, whom you can find loafing at
http://diveintomark.org/. (This is the same as the alternate link in the feed metadata, but it doesn’t have
to be. Many weblogs have multiple authors, each with their own personal website.)
2. The title element gives the title of the article, “Dive into history, 2009 edition”.
297
3. As with the feed-level alternate link, this link element gives the address of the HTML version of this article.
4. Entries, like feeds, need a unique identifier.
5. Entries have two dates: a first-published date (published) and a last-modified date (updated).
6. Entries can have an arbitrary number of categories. This article is filed under diveintopython, docbook, and
html.
7. The summary element gives a brief summary of the article. (There is also a content element, not shown
here, if you want to include the complete article text in your feed.) This summary element has the Atom-
specific type='html' attribute, which specifies that this summary is a snippet of HTML, not plain text. This is
important, since it has HTML-specific entities in it (— and …) which should be rendered as
“—” and “…” rather than displayed directly.
8. Finally, the end tag for the entry element, signaling the end of the metadata for this article.
⁂
12.4. PARSING XML
Python can parse XML documents in several ways. It has traditional DOM and SAX parsers, but I will focus on a different library called ElementTree.
>>> import xml.etree.ElementTree as etree
①
>>> tree = etree.parse('examples/feed.xml')
②
>>> root = tree.getroot()
③
>>> root
④
<Element {http://www.w3.org/2005/Atom}feed at cd1eb0>
1. The ElementTree library is part of the Python standard library, in xml.etree.ElementTree.
2. The primary entry point for the ElementTree library is the parse() function, which can take a filename or a
file-like object. This function parses the entire document at once. If memory is tight, there are ways to parse
an XML document incrementally instead.
3. The parse() function returns an object which represents the entire document. This is not the root element.
To get a reference to the root element, call the getroot() method.
298
4. As expected, the root element is the feed element in the http://www.w3.org/2005/Atom namespace. The
string representation of this object reinforces an important point: an XML element is a combination of its
namespace and its tag name (also called the local name). Every element in this document is in the Atom
namespace, so the root element is represented as {http://www.w3.org/2005/Atom}feed.
☞ ElementTree represents XML elements as {namespace}localname. You’ll see and use
this format in multiple places in the ElementTree API.
12.4.1. ELEMENTS ARE LISTS
In the ElementTree API, an element acts like a list. The items of the list are the element’s children.
# continued from the previous example
>>> root.tag
①
'{http://www.w3.org/2005/Atom}feed'
>>> len(root)
②
8
>>> for child in root:
③
...
print(child)
④
...
<Element {http://www.w3.org/2005/Atom}title at e2b5d0>
<Element {http://www.w3.org/2005/Atom}subtitle at e2b4e0>
<Element {http://www.w3.org/2005/Atom}id at e2b6c0>
<Element {http://www.w3.org/2005/Atom}updated at e2b6f0>
<Element {http://www.w3.org/2005/Atom}link at e2b4b0>
<Element {http://www.w3.org/2005/Atom}entry at e2b720>
<Element {http://www.w3.org/2005/Atom}entry at e2b510>
<Element {http://www.w3.org/2005/Atom}entry at e2b750>
1. Continuing from the previous example, the root element is {http://www.w3.org/2005/Atom}feed.
2. The “length” of the root element is the number of child elements.
3. You can use the element itself as an iterator to loop through all of its child elements.
299
4. As you can see from the output, there are indeed 8 child elements: all of the feed-level metadata (title,
subtitle, id, updated, and link) followed by the three entry elements.
You may have guessed this already, but I want to point it out explicitly: the list of child elements only
includes direct children. Each of the entry elements contain their own children, but those are not included in
the list. They would be included in the list of each entry’s children, but they are not included in the list of
the feed’s children. There are ways to find elements no matter how deeply nested they are; we’ll look at
two such ways later in this chapter.
12.4.2. ATTRIBUTES ARE DICTONARIES
X M L isn’t just a collection of elements; each element can also have its own set of attributes. Once you have
a reference to a specific element, you can easily get its attributes as a Python dictionary.
# continuing from the previous example
>>> root.attrib
①
{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}
>>> root[4]
②
<Element {http://www.w3.org/2005/Atom}link at e181b0>
>>> root[4].attrib
③
{'href': 'http://diveintomark.org/',
'type': 'text/html',
'rel': 'alternate'}
>>> root[3]
④
<Element {http://www.w3.org/2005/Atom}updated at e2b4e0>
>>> root[3].attrib
⑤
{}
1. The attrib property is a dictionary of the element’s attributes. The original markup here was <feed
xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>. The xml: prefix refers to a built-in namespace
that every XML document can use without declaring it.
2. The fifth child — [4] in a 0-based list — is the link element.
3. The link element has three attributes: href, type, and rel.
4. The fourth child — [3] in a 0-based list — is the updated element.
300
5. The updated element has no attributes, so its .attrib is just an empty dictionary.
⁂
12.5. SEARCHING FOR NODES WITHIN AN XML DOCUMENT
So far, we’ve worked with this XML document “from the top down,” starting with the root element, getting
its child elements, and so on throughout the document. But many uses of XML require you to find specific
elements. Etree can do that, too.
>>> import xml.etree.ElementTree as etree
>>> tree = etree.parse('examples/feed.xml')
>>> root = tree.getroot()
>>> root.findall('{http://www.w3.org/2005/Atom}entry')
①
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]
>>> root.tag
'{http://www.w3.org/2005/Atom}feed'
>>> root.findall('{http://www.w3.org/2005/Atom}feed')
②
[]
>>> root.findall('{http://www.w3.org/2005/Atom}author')
③
[]
1. The findall() method finds child elements that match a specific query. (More on the query format in a
minute.)
2. Each element — including the root element, but also child elements — has a findall() method. It finds all
matching elements among the element’s children. But why aren’t there any results? Although it may not be
obvious, this particular query only searches the element’s children. Since the root feed element has no child
named feed, this query returns an empty list.
3. This result may also surprise you. There is an author element in this document; in fact, there are three (one in each entry). But those author elements are not direct children of the root element; they are
301
“grandchildren” (literally, a child element of a child element). If you want to look for author elements at any
nesting level, you can do that, but the query format is slightly different.
>>> tree.findall('{http://www.w3.org/2005/Atom}entry')
①
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]
>>> tree.findall('{http://www.w3.org/2005/Atom}author')
②
[]
1. For convenience, the tree object (returned from the etree.parse() function) has several methods that
mirror the methods on the root element. The results are the same as if you had called the
tree.getroot().findall() method.
2. Perhaps surprisingly, this query does not find the author elements in this document. Why not? Because this
is just a shortcut for tree.getroot().findall('{http://www.w3.org/2005/Atom}author'), which means
“find all the author elements that are children of the root element.” The author elements are not children
of the root element; they’re children of the entry elements. Thus the query doesn’t return any matches.
There is also a find() method which returns the first matching element. This is useful for situations where
you are only expecting one match, or if there are multiple matches, you only care about the first one.
>>> entries = tree.findall('{http://www.w3.org/2005/Atom}entry')
①
>>> len(entries)
3
>>> title_element = entries[0].find('{http://www.w3.org/2005/Atom}title')
②
>>> title_element.text
'Dive into history, 2009 edition'
>>> foo_element = entries[0].find('{http://www.w3.org/2005/Atom}foo')
③
>>> foo_element
>>> type(foo_element)
<class 'NoneType'>
1. You saw this in the previous example. It finds all the atom:entry elements.
2. The find() method takes an ElementTree query and returns the first matching element.
302
3. There are no elements in this entry named foo, so this returns None.
☞ There is a “gotcha” with the find() method that will eventually bite you. In a
boolean context, ElementTree element objects will evaluate to False if they contain
no children ( i.e. if len(element) is 0). This means that if element.find('...') is
not testing whether the find() method found a matching element; it’s testing
whether that matching element has any child elements! To test whether the find()
method returned an element, use if element.find('...') is not None.
There is a way to search for descendant elements, i.e. children, grandchildren, and any element at any nesting level.
303
>>> all_links = tree.findall('//{http://www.w3.org/2005/Atom}link')
①
>>> all_links
[<Element {http://www.w3.org/2005/Atom}link at e181b0>,
<Element {http://www.w3.org/2005/Atom}link at e2b570>,
<Element {http://www.w3.org/2005/Atom}link at e2b480>,
<Element {http://www.w3.org/2005/Atom}link at e2b5a0>]
>>> all_links[0].attrib
②
{'href': 'http://diveintomark.org/',
'type': 'text/html',
'rel': 'alternate'}
>>> all_links[1].attrib
③
{'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
'type': 'text/html',
'rel': 'alternate'}
>>> all_links[2].attrib
{'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress',
'type': 'text/html',
'rel': 'alternate'}
>>> all_links[3].attrib
{'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats',
'type': 'text/html',
'rel': 'alternate'}
1. This query — //{http://www.w3.org/2005/Atom}link — is very similar to the previous examples, except
for the two slashes at the beginning of the query. Those two slashes mean “don’t just look for direct
children; I want any elements, regardless of nesting level.” So the result is a list of four link elements, not
just one.
2. The first result is a direct child of the root element. As you can see from its attributes, this is the feed-level
alternate link that points to the HTML version of the website that the feed describes.
3. The other three results are each entry-level alternate links. Each entry has a single link child element, and
because of the double slash at the beginning of the query, this query finds all of them.
Overall, ElementTree’s findall() method is a very powerful feature, but the query language can be a bit
surprising. It is officially described as “limited support for XPath expressions.” XPath is a W3C standard for 304
querying XML documents. ElementTree’s query language is similar enough to XPath to do basic searching,
but dissimilar enough that it may annoy you if you already know XPath. Now let’s look at a third-party XML
library that extends the ElementTree API with full XPath support.
⁂
12.6. GOING FURTHER WITH LXML
lxml is an open source third-party library that builds on the popular libxml2 parser. It provides a 100%
compatible ElementTree API, then extends it with full XPath 1.0 support and a few other niceties. There are
installers available for Windows; Linux users should always try to use distribution-specific tools like yum or apt-get to install precompiled binaries from their repositories. Otherwise you’ll need to install lxml
>>> from lxml import etree
①
>>> tree = etree.parse('examples/feed.xml')
②
>>> root = tree.getroot()
③
>>> root.findall('{http://www.w3.org/2005/Atom}entry')
④
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]
1. Once imported, lxml provides the same API as the built-in ElementTree library.
2. parse() function: same as ElementTree.
3. getroot() method: also the same.
4. findall() method: exactly the same.
For large XML documents, lxml is significantly faster than the built-in ElementTree library. If you’re only
using the ElementTree API and want to use the fastest available implementation, you can try to import lxml
and fall back to the built-in ElementTree.
305
try:
from lxml import etree
except ImportError:
import xml.etree.ElementTree as etree
But lxml is more than just a faster ElementTree. Its findall() method includes support for more
complicated expressions.
>>> import lxml.etree
①
>>> tree = lxml.etree.parse('examples/feed.xml')
>>> tree.findall('//{http://www.w3.org/2005/Atom}*[@href]')
②
[<Element {http://www.w3.org/2005/Atom}link at eeb8a0>,
<Element {http://www.w3.org/2005/Atom}link at eeb990>,
<Element {http://www.w3.org/2005/Atom}link at eeb960>,
<Element {http://www.w3.org/2005/Atom}link at eeb9c0>]
>>> tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']")
③
[<Element {http://www.w3.org/2005/Atom}link at eeb930>]
>>> NS = '{http://www.w3.org/2005/Atom}'
>>> tree.findall('//{NS}author[{NS}uri]'.format(NS=NS))
④
[<Element {http://www.w3.org/2005/Atom}author at eeba80>,
<Element {http://www.w3.org/2005/Atom}author at eebba0>]
1. In this example, I’m going to import lxml.etree (instead of, say, from lxml import etree), to emphasize
that these features are specific to lxml.
2. This query finds all elements in the Atom namespace, anywhere in the document, that have an href
attribute. The // at the beginning of the query means “elements anywhere (not just as children of the root
element).” {http://www.w3.org/2005/Atom} means “only elements in the Atom namespace.” * means
“elements with any local name.” And [@href] means “has an href attribute.”
3. The query finds all Atom elements with an href whose value is http://diveintomark.org/.
4. After doing some quick string formatting (because otherwise these compound queries get ridiculously long), this query searches for Atom author elements that have an Atom uri element as a child. This only returns
two author elements, the ones in the first and second entry. The author in the last entry contains only a
name, not a uri.
306
Not enough for you? lxml also integrates support for arbitrary XPath 1.0 expressions. I’m not going to go
into depth about XPath syntax; that could be a whole book unto itself! But I will show you how it integrates
into lxml.
>>> import lxml.etree
>>> tree = lxml.etree.parse('examples/feed.xml')
>>> NSMAP = {'atom': 'http://www.w3.org/2005/Atom'}
①
>>> entries = tree.xpath("//atom:category[@term='accessibility']/..",
②
...
namespaces=NSMAP)
>>> entries
③
[<Element {http://www.w3.org/2005/Atom}entry at e2b630>]
>>> entry = entries[0]
>>> entry.xpath('./atom:title/text()', namespaces=NSMAP)
④
['Accessibility is a harsh mistress']
1. To perform XPath queries on namespaced elements, you need to define a namespace prefix mapping. This is
just a Python dictionary.
2. Here is an XPath query. The XPath expression searches for category elements (in the Atom namespace)
that contain a term attribute with the value accessibility. But that’s not actually the query result. Look at
the very end of the query string; did you notice the /.. bit? That means “and then return the parent
element of the category element you just found.” So this single XPath query will find all entries with a child
element of <category term='accessibility'>.
3. The xpath() function returns a list of ElementTree objects. In this document, there is only one entry with a
category whose term is accessibility.
4. XPath expressions don’t always return a list of elements. Technically, the DOM of a parsed XML document
doesn’t contain elements; it contains nodes. Depending on their type, nodes can be elements, attributes, or
even text content. The result of an XPath query is a list of nodes. This query returns a list of text nodes:
the text content (text()) of the title element (atom:title) that is a child of the current element (./).
⁂
307
12.7. GENERATING XML
Python’s support for XML is not limited to parsing existing documents. You can also create XML documents
from scratch.
>>> import xml.etree.ElementTree as etree
>>> new_feed = etree.Element('{http://www.w3.org/2005/Atom}feed',
①
...
attrib={'{http://www.w3.org/XML/1998/namespace}lang': 'en'})
②
>>> print(etree.tostring(new_feed))
③
<ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/>
1. To create a new element, instantiate the Element class. You pass the element name (namespace + local
name) as the first argument. This statement creates a feed element in the Atom namespace. This will be our
new document’s root element.
2. To add attributes to the newly created element, pass a dictionary of attribute names and values in the
attrib argument. Note that the attribute name should be in the standard ElementTree format,
{namespace}localname.
3. At any time, you can serialize any element (and its children) with the ElementTree tostring() function.
Was that serialization surprising to you? The way ElementTree serializes namespaced XML elements is
technically accurate but not optimal. The sample XML document at the beginning of this chapter defined a
default namespace (xmlns='http://www.w3.org/2005/Atom'). Defining a default namespace is useful for
documents — like Atom feeds — where every element is in the same namespace, because you can declare
the namespace once and declare each element with just its local name (<feed>, <link>, <entry>). There is
no need to use any prefixes unless you want to declare elements from another namespace.
An XML parser won’t “see” any difference between an XML document with a default namespace and an XML
document with a prefixed namespace. The resulting DOM of this serialization:
<ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/>
is identical to the DOM of this serialization:
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'/>
308
The only practical difference is that the second serialization is several characters shorter. If we were to
recast our entire sample feed with a ns0: prefix in every start and end tag, it would add 4 characters per
start tag × 79 tags + 4 characters for the namespace declaration itself, for a total of 320 characters.
Assuming UTF-8 encoding, that’s 320 extra bytes. (After gzipping, the difference drops to 21 bytes, but still, 21 bytes is 21 bytes.) Maybe that doesn’t matter to you, but for something like an Atom feed, which may be
downloaded several thousand times whenever it changes, saving a few bytes per request can quickly add up.
The built-in ElementTree library does not offer this fine-grained control over serializing namespaced
elements, but lxml does.
>>> import lxml.etree
>>> NSMAP = {None: 'http://www.w3.org/2005/Atom'}
①
>>> new_feed = lxml.etree.Element('feed', nsmap=NSMAP)
②
>>> print(lxml.etree.tounicode(new_feed))
③
<feed xmlns='http://www.w3.org/2005/Atom'/>
>>> new_feed.set('{http://www.w3.org/XML/1998/namespace}lang', 'en')
④
>>> print(lxml.etree.tounicode(new_feed))
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'/>
1. To start, define a namespace mapping as a dictionary. Dictionary values are namespaces; dictionary keys are
the desired prefix. Using None as a prefix effectively declares a default namespace.
2. Now you can pass the lxml-specific nsmap argument when you create an element, and lxml will respect the
namespace prefixes you’ve defined.
3. As expected, this serialization defines the Atom namespace as the default namespace and declares the feed
element without a namespace prefix.
4. Oops, we forgot to add the xml:lang attribute. You can always add attributes to any element with the
set() method. It takes two arguments: the attribute name in standard ElementTree format, then the
attribute value. (This method is not lxml-specific. The only lxml-specific part of this example was the nsmap
argument to control the namespace prefixes in the serialized output.)
Are XML documents limited to one element per document? No, of course not. You can easily create child
elements, too.
309
>>> title = lxml.etree.SubElement(new_feed, 'title',
①
...
attrib={'type':'html'})
②
>>> print(lxml.etree.tounicode(new_feed))
③
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'><title type='html'/></feed>
>>> title.text = 'dive into …'
④
>>> print(lxml.etree.tounicode(new_feed))
⑤
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'><title type='html'>dive into &hellip;</title></feed>
>>> print(lxml.etree.tounicode(new_feed, pretty_print=True))
⑥
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
<title type='html'>dive into&hellip;</title>
</feed>
1. To create a child element of an existing element, instantiate the SubElement class. The only required
arguments are the parent element (new_feed in this case) and the new element’s name. Since this child
element will inherit the namespace mapping of its parent, there is no need to redeclare the namespace or
prefix here.
2. You can also pass in an attribute dictionary. Keys are attribute names; values are attribute values.
3. As expected, the new title element was created in the Atom namespace, and it was inserted as a child of
the feed element. Since the title element has no text content and no children of its own, lxml serializes it
as an empty element (with the /> shortcut).
4. To set the text content of an element, simply set its .text property.
5. Now the title element is serialized with its text content. Any text content that contains less-than signs or
ampersands needs to be escaped when serialized. lxml handles this escaping automatically.
6. You can also apply “pretty printing” to the serialization, which inserts line breaks after end tags, and after
start tags of elements that contain child elements but no text content. In technical terms, lxml adds
“insignificant whitespace” to make the output more readable.
☞ You might also want to check out xmlwitch, another third-party library for generating X M L . It makes extensive use of the with statement to make X M L generation code
more readable.
⁂
310
12.8. PARSING BROKEN XML
The XML specification mandates that all conforming XML parsers employ “draconian error handling.” That is,
they must halt and catch fire as soon as they detect any sort of wellformedness error in the XML document.
Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters,
and a number of other esoteric rules. This is in stark contrast to other common formats like HTML — your
browser doesn’t stop rendering a web page if you forget to close an HTML tag or escape an ampersand in
an attribute value. (It is a common misconception that HTML has no defined error handling. HTML error
handling is actually quite well-defined, but it’s significantly more complicated than “halt and catch fire on first error.”)
Some people (myself included) believe that it was a mistake for the inventors of XML to mandate draconian
error handling. Don’t get me wrong; I can certainly see the allure of simplifying the error handling rules. But
in practice, the concept of “wellformedness” is trickier than it sounds, especially for XML documents (like
Atom feeds) that are published on the web and served over HTTP. Despite the maturity of XML, which
standardized on draconian error handling in 1997, surveys continually show a significant fraction of Atom
feeds on the web are plagued with wellformedness errors.
So, I have both theoretical and practical reasons to parse XML documents “at any cost,” that is, not to halt
and catch fire at the first wellformedness error. If you find yourself wanting to do this too, lxml can help.
Here is a fragment of a broken XML document. I’ve highlighted the wellformedness error.
<?xml version='1.0' encoding='utf-8'?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
<title>dive into …</title>
...
</feed>
That’s an error, because the … entity is not defined in XML. (It is defined in HTML.) If you try to
parse this broken feed with the default settings, lxml will choke on the undefined entity.
311
>>> import lxml.etree
>>> tree = lxml.etree.parse('examples/feed-broken.xml')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2693, in lxml.etree.parse (src/lxml/lxml.etree.c:52591)
File "parser.pxi", line 1478, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:75665)
File "parser.pxi", line 1507, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:75993)
File "parser.pxi", line 1407, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:75002)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:72023)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:67830)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:68877)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:68125)
lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28
To parse this broken XML document, despite its wellformedness error, you need to create a custom XML
parser.
>>> parser = lxml.etree.XMLParser(recover=True)
①
>>> tree = lxml.etree.parse('examples/feed-broken.xml', parser)
②
>>> parser.error_log
③
examples/feed-broken.xml:3:28:FATAL:PARSER:ERR_UNDECLARED_ENTITY: Entity 'hellip' not defined
>>> tree.findall('{http://www.w3.org/2005/Atom}title')
[<Element {http://www.w3.org/2005/Atom}title at ead510>]
>>> title = tree.findall('{http://www.w3.org/2005/Atom}title')[0]
>>> title.text
④
'dive into '
>>> print(lxml.etree.tounicode(tree.getroot()))
⑤
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
<title>dive into </title>
.
. [rest of serialization snipped for brevity]
.
312
1. To create a custom parser, instantiate the lxml.etree.XMLParser class. It can take a number of different
named arguments. The one we’re interested in here is the recover argument. When set to True, the XML
parser will try its best to “recover” from wellformedness errors.
2. To parse an XML document with your custom parser, pass the parser object as the second argument to
the parse() function. Note that lxml does not raise an exception about the undefined … entity.
3. The parser keeps a log of the wellformedness errors that it has encountered. (This is actually true regardless
of whether it is set to recover from those errors or not.)
4. Since it didn’t know what to do with the undefined … entity, the parser just silently dropped it. The
text content of the title element becomes 'dive into '.
5. As you can see from the serialization, the … entity didn’t get moved; it was simply dropped.
It is important to reiterate that there is no guarantee of interoperability with “recovering” XML parsers.
A different parser might decide that it recognized the … entity from HTML, and replace it with
&hellip; instead. Is that “better”? Maybe. Is it “more correct”? No, they are both equally incorrect.
The correct behavior (according to the XML specification) is to halt and catch fire. If you’ve decided not to
do that, you’re on your own.
⁂
12.9. FURTHER READING
• XPath Support in ElementTree
• The ElementTree iterparse Function
• lxml
• Parsing XML and HTML with lxml
• xmlwitch
313
CHAPTER 13. SERIALIZING PYTHON OBJECTS
❝ Every Saturday since we’ve lived in this apartment, I have awakened at 6:15, poured myself a bowl of cereal,
added
a quarter-cup of 2% milk, sat on this end of this couch, turned on BBC America, and watched Doctor Who. ❞
— Sheldon, The Big Bang Theory
13.1. DIVING IN
Onthesurface,theconceptofserializationissimple.Youhaveadatastructureinmemorythatyou
want to save, reuse, or send to someone else. How would you do that? Well, that depends on how you
want to save it, how you want to reuse it, and to whom you want to send it. Many games allow you to save
your progress when you quit the game and pick up where you left off when you relaunch the game.
(Actually, many non-gaming applications do this as well.) In this case, a data structure that captures “your
progress so far” needs to be stored on disk when you quit, then loaded from disk when you relaunch. The
data is only meant to be used by the same program that created it, never sent over a network, and never
read by anything other than the program that created it. Therefore, the interoperability issues are limited to
ensuring that later versions of the program can read data written by earlier versions.
For cases like this, the pickle module is ideal. It’s part of the Python standard library, so it’s always
available. It’s fast; the bulk of it is written in C, like the Python interpreter itself. It can store arbitrarily
complex Python data structures.
What can the pickle module store?
• All the native datatypes that Python supports: booleans, integers, floating point numbers, complex numbers, strings, bytes objects, byte arrays, and None.
• Lists, tuples, dictionaries, and sets containing any combination of native datatypes.
• Lists, tuples, dictionaries, and sets containing any combination of lists, tuples, dictionaries, and sets containing
any combination of native datatypes (and so on, to the maximum nesting level that Python supports).
314
• Functions, classes, and instances of classes (with caveats).
If this isn’t enough for you, the pickle module is also extensible. If you’re interested in extensibility, check
out the links in the Further Reading section at the end of the chapter.
13.1.1. A QUICK NOTE ABOUT THE EXAMPLES IN THIS CHAPTER
This chapter tells a tale with two Python Shells. All of the examples in this chapter are part of a single story
arc. You will be asked to switch back and forth between the two Python Shells as I demonstrate the pickle
and json modules.
To help keep things straight, open the Python Shell and define the following variable:
>>> shell = 1
Keep that window open. Now open another Python Shell and define the following variable:
>>> shell = 2
Throughout this chapter, I will use the shell variable to indicate which Python Shell is being used in each
example.
⁂
13.2. SAVING DATA TO A PICKLE FILE
The pickle module works with data structures. Let’s build one.
315
>>> shell
①
1
>>> entry = {}
②
>>> entry['title'] = 'Dive into history, 2009 edition'
>>> entry['article_link'] = 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'
>>> entry['comments_link'] = None
>>> entry['internal_id'] = b'\xDE\xD5\xB4\xF8'
>>> entry['tags'] = ('diveintopython', 'docbook', 'html')
>>> entry['published'] = True
>>> import time
>>> entry['published_date'] = time.strptime('Fri Mar 27 22:20:42 2009')
③
>>> entry['published_date']
time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1)
1. Follow along in Python Shell #1.
2. The idea here is to build a Python dictionary that could represent something useful, like an entry in an Atom
feed. But I also want to ensure that it contains several different types of data, to show off the pickle module. Don’t read too much into these values.
3. The time module contains a data structure (time_struct) to represent a point in time (accurate to one
millisecond) and functions to manipulate time structs. The strptime() function takes a formatted string an
converts it to a time_struct. This string is in the default format, but you can control that with format
codes. See the time module for more details.
That’s a handsome-looking Python dictionary. Let’s save it to a file.
>>> shell
①
1
>>> import pickle
>>> with open('entry.pickle', 'wb') as f:
②
...
pickle.dump(entry, f)
③
...
1. This is still in Python Shell #1.
316
2. Use the open() function to open a file. Set the file mode to 'wb' to open the file for writing in binary
mode. Wrap it in a with statement to ensure the file is closed automatically when you’re done with it.
3. The dump() function in the pickle module takes a serializable Python data structure, serializes it into a
binary, Python-specific format using the latest version of the pickle protocol, and saves it to an open file.
That last sentence was pretty important.
• The pickle module takes a Python data structure and saves it to a file.
• To do this, it serializes the data structure using a data format called “the pickle protocol.”
• The pickle protocol is Python-specific; there is no guarantee of cross-language compatibility. You probably
couldn’t take the entry.pickle file you just created and do anything useful with it in Perl, PHP, Java, or any
other language.
• Not every Python data structure can be serialized by the pickle module. The pickle protocol has changed
several times as new data types have been added to the Python language, but there are still limitations.
• As a result of these changes, there is no guarantee of compatibility between different versions of Python
itself. Newer versions of Python support the older serialization formats, but older versions of Python do not
support newer formats (since they don’t support the newer data types).
• Unless you specify otherwise, the functions in the pickle module will use the latest version of the pickle
protocol. This ensures that you have maximum flexibility in the types of data you can serialize, but it also
means that the resulting file will not be readable by older versions of Python that do not support the latest
version of the pickle protocol.
• The latest version of the pickle protocol is a binary format. Be sure to open your pickle files in binary mode, or the data will get corrupted during writing.
⁂
13.3. LOADING DATA FROM A PICKLE FILE
Now switch to your second Python Shell — i.e. not the one where you created the entry dictionary.
317
>>> shell
①
2
>>> entry
②
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'entry' is not defined
>>> import pickle
>>> with open('entry.pickle', 'rb') as f:
③
...
entry = pickle.load(f)
④
...
>>> entry
⑤
{'comments_link': None,
'internal_id': b'\xDE\xD5\xB4\xF8',
'title': 'Dive into history, 2009 edition',
'tags': ('diveintopython', 'docbook', 'html'),
'article_link':
'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
'published_date': time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1),
'published': True}
1. This is Python Shell #2.
2. There is no entry variable defined here. You defined an entry variable in Python Shell #1, but that’s a
completely different environment with its own state.
3. Open the entry.pickle file you created in Python Shell #1. The pickle module uses a binary data format,
so you should always open pickle files in binary mode.
4. The pickle.load() function takes a stream object, reads the serialized data from the stream, creates a new Python object, recreates the serialized data in the new Python object, and returns the new Python object.
5. Now the entry variable is a dictionary with familiar-looking keys and values.
The pickle.dump() / pickle.load() cycle results in a new data structure that is equal to the original data
structure.
318
>>> shell
①
1
>>> with open('entry.pickle', 'rb') as f:
②
...
entry2 = pickle.load(f)
③
...
>>> entry2 == entry
④
True
>>> entry2 is entry
⑤
False
>>> entry2['tags']
⑥
('diveintopython', 'docbook', 'html')
>>> entry2['internal_id']
b'\xDE\xD5\xB4\xF8'
1. Switch back to Python Shell #1.
2. Open the entry.pickle file.
3. Load the serialized data into a new variable, entry2.
4. Python confirms that the two dictionaries, entry and entry2, are equal. In this shell, you built entry from
the ground up, starting with an empty dictionary and manually assigning values to specific keys. You serialized
this dictionary and stored it in the entry.pickle file. Now you’ve read the serialized data from that file and
created a perfect replica of the original data structure.
5. Equality is not the same as identity. I said you’ve created a perfect replica of the original data structure, which is true. But it’s still a copy.
6. For reasons that will become clear later in this chapter, I want to point out that the value of the 'tags'
key is a tuple, and the value of the 'internal_id' key is a bytes object.
⁂
13.4. PICKLING WITHOUT A FILE
The examples in the previous section showed how to serialize a Python object directly to a file on disk. But
what if you don’t want or need a file? You can also serialize to a bytes object in memory.
319
>>> shell
1
>>> b = pickle.dumps(entry)
①
>>> type(b)
②
<class 'bytes'>
>>> entry3 = pickle.loads(b)
③
>>> entry3 == entry
④
True
1. The pickle.dumps() function (note the 's' at the end of the function name) performs the same
serialization as the pickle.dump() function. Instead of taking a stream object and writing the serialized data
to a file on disk, it simply returns the serialized data.
2. Since the pickle protocol uses a binary data format, the pickle.dumps() function returns a bytes object.
3. The pickle.loads() function (again, note the 's' at the end of the function name) performs the same
deserialization as the pickle.load() function. Instead of taking a stream object and reading the serialized
data from a file, it takes a bytes object containing serialized data, such as the one returned by the
pickle.dumps() function.
4. The end result is the same: a perfect replica of the original dictionary.
⁂
13.5. BYTES AND STRINGS REAR THEIR UGLY HEADS AGAIN
The pickle protocol has been around for many years, and it has matured as Python itself has matured. There
are now four different versions of the pickle protocol.
• Python 1.x had two pickle protocols, a text-based format (“version 0”) and a binary format (“version 1”).
• Python 2.3 introduced a new pickle protocol (“version 2”) to handle new functionality in Python class
objects. It is a binary format.
• Python 3.0 introduced another pickle protocol (“version 3”) with explicit support for bytes objects and byte
arrays. It is a binary format.
320
Oh look, the difference between bytes and strings rears its ugly head again. (If you’re surprised, you haven’t been paying attention.) What this means in practice is that, while Python 3 can read data pickled with
protocol version 2, Python 2 can not read data pickled with protocol version 3.
⁂
13.6. DEBUGGING PICKLE FILES
What does the pickle protocol look like? Let’s jump out of the Python Shell for a moment and take a look
at that entry.pickle file we created.
you@localhost:~/diveintopython3/examples$ ls -l entry.pickle
-rw-r--r-- 1 you you 358 Aug 3 13:34 entry.pickle
you@localhost:~/diveintopython3/examples$ cat entry.pickle
comments_linkqNXtagsqXdiveintopythonqXdocbookqXhtmlq?qX publishedq?
XlinkXJhttp://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition
q Xpublished_dateq
ctime
struct_time
?qRqXtitleqXDive into history, 2009 editionqu.
That wasn’t terribly helpful. You can see the strings, but other datatypes end up as unprintable (or at least
unreadable) characters. Fields are not obviously delimited by tabs or spaces. This is not a format you would
want to debug by yourself.
321
>>> shell
1
>>> import pickletools
>>> with open('entry.pickle', 'rb') as f:
...
pickletools.dis(f)
0: \x80 PROTO 3
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: X BINUNICODE 'published_date'
25: q BINPUT 1
27: c GLOBAL 'time struct_time'
45: q BINPUT 2
47: ( MARK
48: M BININT2 2009
51: K BININT1 3
53: K BININT1 27
55: K BININT1 22
57: K BININT1 20
59: K BININT1 42
61: K BININT1 4
63: K BININT1 86
65: J BININT -1
70: t TUPLE (MARK at 47)
71: q BINPUT 3
73: } EMPTY_DICT
74: q BINPUT 4
76: \x86 TUPLE2
77: q BINPUT 5
79: R REDUCE
80: q BINPUT 6
82: X BINUNICODE 'comments_link'
100: q BINPUT 7
102: N NONE