第5页 | Dive Into Python 3-Mark Pilgrim

已读55%
预计阅读本页时间：-

Pocket

from_roman should fail with too many repeated numerals ... ok

from_roman should give known result with known input ... ok

to_roman should give known result with known input ... ok

广告：个人专属 VPN，独立 IP，无限流量，多机房切换，还可以屏蔽广告和恶意软件，每月最低仅 5 美元

from_roman(to_roman(n))==n for all n ... ok

to_roman should fail with negative input ... ok

to_roman should fail with non-integer input ... ok

to_roman should fail with large input ... ok

to_roman should fail with 0 input ... ok

----------------------------------------------------------------------

Ran 12 tests in 0.031s

①

1. Not that you asked, but it’s fast, too! Like, almost 10× as fast. Of course, it’s not entirely a fair comparison,

because this version takes longer to import (when it builds the lookup tables). But since the import is only

done once, the startup cost is amortized over all the calls to the to_roman() and from_roman() functions.

Since the tests make several thousand function calls (the roundtrip test alone makes 10,000), this savings

adds up in a hurry!

The moral of the story?

• Simplicity is a virtue.

• Especially when regular expressions are involved.

• Unit tests can give you the confidence to do large-scale refactoring.

⁂

263

10.4. SUMMARY

Unit testing is a powerful concept which, if properly implemented, can both reduce maintenance costs and

increase flexibility in any long-term project. It is also important to understand that unit testing is not a

panacea, a Magic Problem Solver, or a silver bullet. Writing good test cases is hard, and keeping them up to

date takes discipline (especially when customers are screaming for critical bug fixes). Unit testing is not a

replacement for other forms of testing, including functional testing, integration testing, and user acceptance

testing. But it is feasible, and it does work, and once you’ve seen it work, you’ll wonder how you ever got

along without it.

These few chapters have covered a lot of ground, and much of it wasn’t even Python-specific. There are unit

testing frameworks for many languages, all of which require you to understand the same basic concepts:

• Designing test cases that are specific, automated, and independent

• Writing test cases before the code they are testing

• Writing tests that test good input and check for proper results

• Writing tests that test bad input and check for proper failure responses

• Writing and updating test cases to reflect new requirements

• Refactoring mercilessly to improve performance, scalability, readability, maintainability, or whatever other -ility

you’re lacking

264

CHAPTER 11. FILES

❝ A nine mile walk is no joke, especially in the rain. ❞

— Harry Kemelman, The Nine Mile Walk

11.1. DIVING IN

MyWindowslaptophad38,493filesbeforeIinstalledasingleapplication.InstallingPython3added

almost 3,000 files to that total. Files are the primary storage paradigm of every major operating system; the

concept is so ingrained that most people would have trouble imagining an alternative. Your computer is, metaphorically speaking, drowning in files.

11.2. READING FROM TEXT FILES

Before you can read from a file, you need to open it. Opening a file in Python couldn’t be easier:

a_file = open('examples/chinese.txt', encoding='utf-8')

Python has a built-in open() function, which takes a filename as an argument. Here the filename is

'examples/chinese.txt'. There are five interesting things about this filename:

1. It’s not just the name of a file; it’s a combination of a directory path and a filename. A hypothetical file-

opening function could have taken two arguments — a directory path and a filename — but the open()

function only takes one. In Python, whenever you need a “filename,” you can include some or all of a

directory path as well.

2. The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses

backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python,

forward slashes always Just Work, even on Windows.

3. The directory path does not begin with a slash or a drive letter, so it is called a relative path. Relative to

what, you might ask? Patience, grasshopper.

265

4. It’s a string. All modern operating systems (even Windows!) use Unicode to store the names of files and

directories. Python 3 fully supports non-ASCII pathnames.

5. It doesn’t need to be on your local disk. You might have a network drive mounted. That “file” might be a

figment of an entirely virtual filesystem. If your computer considers it a file and can access it as a file, Python can open it.

But that call to the open() function didn’t stop at the filename. There’s another argument, called encoding.

Oh dear, that sounds dreadfully familiar.

11.2.1. CHARACTER ENCODING REARS ITS UGLY HEAD

Bytes are bytes; characters are an abstraction. A string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text

file” from disk, how does Python convert that sequence of bytes into a sequence of characters? It decodes

the bytes according to a specific character encoding algorithm and returns a sequence of Unicode characters

(otherwise known as a string).

# This example was created on Windows. Other platforms may

# behave differently, for reasons outlined below.

>>> file = open('examples/chinese.txt')

>>> a_string = file.read()

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode

return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 28: character maps to <undefined>

>>>

266

What just happened? You didn’t specify a character

encoding, so Python is forced to use the default

encoding. What’s the default encoding? If you look

closely at the traceback, you can see that it’s dying in

cp1252.py, meaning that Python is using CP-1252 as the

The default

default encoding here. (CP-1252 is a common encoding

on computers running Microsoft Windows.) The

encoding is

CP-1252 character set doesn’t support the characters

that are in this file, so the read fails with an ugly

platform-

UnicodeDecodeError.

dependent.

But wait, it’s worse than that! The default encoding is

platform-dependent, so this code might work on your

computer (if your default encoding is UTF-8), but then

it will fail when you distribute it to someone else

(whose default encoding is different, like CP-1252).

☞ If you need to get the default character encoding, import the locale module and call

locale.getpreferredencoding(). On my Windows laptop, it returns 'cp1252', but

on my Linux box upstairs, it returns 'UTF8'. I can’t even maintain consistency in my

own house! Your results may be different (even on Windows) depending on which

version of your operating system you have installed and how your regional/language

settings are configured. This is why it’s so important to specify the encoding every

time you open a file.

11.2.2. STREAM OBJECTS

So far, all we know is that Python has a built-in function called open(). The open() function returns a

stream object, which has methods and attributes for getting information about and manipulating a stream of

characters.

267

>>> a_file = open('examples/chinese.txt', encoding='utf-8')

>>> a_file.name

①

'examples/chinese.txt'

>>> a_file.encoding

②

'utf-8'

>>> a_file.mode

③

'r'

1. The name attribute reflects the name you passed in to the open() function when you opened the file. It is

not normalized to an absolute pathname.

2. Likewise, encoding attribute reflects the encoding you passed in to the open() function. If you didn’t specify

the encoding when you opened the file (bad developer!) then the encoding attribute will reflect

locale.getpreferredencoding().

3. The mode attribute tells you in which mode the file was opened. You can pass an optional mode parameter

to the open() function. You didn’t specify a mode when you opened this file, so Python defaults to 'r',

which means “open for reading only, in text mode.” As you’ll see later in this chapter, the file mode serves

several purposes; different modes let you write to a file, append to a file, or open a file in binary mode (in

which you deal with bytes instead of strings).

☞ The documentation for the open() function lists all the possible file modes.

11.2.3. READING DATA FROM A TEXT FILE

After you open a file for reading, you’ll probably want to read from it at some point.

>>> a_file = open('examples/chinese.txt', encoding='utf-8')

>>> a_file.read()

①

'Dive Into Python 是为有经验的程序员编写的一本 Python 书。\n'

>>> a_file.read()

②

268

1. Once you open a file (with the correct encoding), reading from it is just a matter of calling the stream

object’s read() method. The result is a string.

2. Perhaps somewhat surprisingly, reading the file again does not raise an exception. Python does not consider

reading past end-of-file to be an error; it simply returns an empty string.

What if you want to re-read a file?

# continued from the previous example

>>> a_file.read()

①

Always

>>> a_file.seek(0)

②

specify an

>>> a_file.read(16)

③

'Dive Into Python'

encoding

>>> a_file.read(1)

④

' '

parameter

>>> a_file.read(1)

'是'

when you

>>> a_file.tell()

⑤

open a file.

1. Since you’re still at the end of the file, further calls to

the stream object’s read() method simply return an

empty string.

2. The seek() method moves to a specific byte position in a file.

3. The read() method can take an optional parameter, the number of characters to read.

4. If you like, you can even read one character at a time.

5. 16 + 1 + 1 = … 20?

Let’s try that again.

269

# continued from the previous example

>>> a_file.seek(17)

①

>>> a_file.read(1)

②

'是'

>>> a_file.tell()

③

1. Move to the 17th byte.

2. Read one character.

3. Now you’re on the 20th byte.

Do you see it yet? The seek() and tell() methods always count bytes, but since you opened this file as

text, the read() method counts characters. Chinese characters require multiple bytes to encode in UTF-8.

The English characters in the file only require one byte each, so you might be misled into thinking that the

seek() and read() methods are counting the same thing. But that’s only true for some characters.

But wait, it gets worse!

>>> a_file.seek(18)

①

>>> a_file.read(1)

②

Traceback (most recent call last):

File "<pyshell#12>", line 1, in <module>

a_file.read(1)

File "C:\Python31\lib\codecs.py", line 300, in decode

(result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf8' codec can't decode byte 0x98 in position 0: unexpected code byte

1. Move to the 18th byte and try to read one character.

2. Why does this fail? Because there isn’t a character at the 18th byte. The nearest character starts at the 17th

byte (and goes for three bytes). Trying to read a character from the middle will fail with a

UnicodeDecodeError.

270

11.2.4. CLOSING FILES

Open files consume system resources, and depending on the file mode, other programs may not be able to

access them. It’s important to close files as soon as you’re finished with them.

# continued from the previous example

>>> a_file.close()

Well that was anticlimactic.

The stream object a_file still exists; calling its close() method doesn’t destroy the object itself. But it’s

not terribly useful.

# continued from the previous example

>>> a_file.read()

①

Traceback (most recent call last):

File "<pyshell#24>", line 1, in <module>

a_file.read()

ValueError: I/O operation on closed file.

>>> a_file.seek(0)

②

Traceback (most recent call last):

File "<pyshell#25>", line 1, in <module>

a_file.seek(0)

ValueError: I/O operation on closed file.

>>> a_file.tell()

③

Traceback (most recent call last):

File "<pyshell#26>", line 1, in <module>

a_file.tell()

ValueError: I/O operation on closed file.

>>> a_file.close()

④

>>> a_file.closed

⑤

True

1. You can’t read from a closed file; that raises an IOError exception.

271

2. You can’t seek in a closed file either.

3. There’s no current position in a closed file, so the tell() method also fails.

4. Perhaps surprisingly, calling the close() method on a stream object whose file has been closed does not

raise an exception. It’s just a no-op.

5. Closed stream objects do have one useful attribute: the closed attribute will confirm that the file is closed.

11.2.5. CLOSING FILES AUTOMATICALLY

Stream objects have an explicit close() method, but

what happens if your code has a bug and crashes before

you call close()? That file could theoretically stay open

for much longer than necessary. While you’re debugging

on your local computer, that’s not a big deal. On a

try..finally

production server, maybe it is.

is good.

Python 2 had a solution for this: the try..finally

block. That still works in Python 3, and you may see it

with is

in other people’s code or in older code that was ported

to Python 3. But Python 2.6 introduced a cleaner

better.

solution, which is now the preferred solution in Python

3: the with statement.

with open('examples/chinese.txt', encoding='utf-8') as a_file:

a_file.seek(17)

a_character = a_file.read(1)

print(a_character)

This code calls open(), but it never calls a_file.close(). The with statement starts a code block, like an

if statement or a for loop. Inside this code block, you can use the variable a_file as the stream object

returned from the call to open(). All the regular stream object methods are available — seek(), read(),

whatever you need. When the with block ends, Python calls a_file.close() automatically.

272

Here’s the kicker: no matter how or when you exit the with block, Python will close that file… even if you

“exit” it via an unhandled exception. That’s right, even if your code raises an exception and your entire

program comes to a screeching halt, that file will get closed. Guaranteed.

☞ In technical terms, the with statement creates a runtime context. In these examples,

the stream object acts as a context manager. Python creates the stream object

a_file and tells it that it is entering a runtime context. When the with code block

is completed, Python tells the stream object that it is exiting the runtime context,

and the stream object calls its own close() method. See Appendix B, “Classes That

Can Be Used in a with Block” for details.

There’s nothing file-specific about the with statement; it’s just a generic framework for creating runtime

contexts and telling objects that they’re entering and exiting a runtime context. If the object in question is a

stream object, then it does useful file-like things (like closing the file automatically). But that behavior is

defined in the stream object, not in the with statement. There are lots of other ways to use context

managers that have nothing to do with files. You can even create your own, as you’ll see later in this

chapter.

11.2.6. READING DATA ONE LINE AT A TIME

A “line” of a text file is just what you think it is — you type a few words and press ENTER, and now you’re

on a new line. A line of text is a sequence of characters delimited by… what exactly? Well, it’s complicated,

because text files can use several different characters to mark the end of a line. Every operating system has

its own convention. Some use a carriage return character, others use a line feed character, and some use

both characters at the end of every line.

Now breathe a sigh of relief, because Python handles line endings automatically by default. If you say, “I want to

read this text file one line at a time,” Python will figure out which kind of line ending the text file uses and

and it will all Just Work.

☞

273

If you need fine-grained control over what’s considered a line ending, you can pass

the optional newline parameter to the open() function. See the open() function

documentation for all the gory details.

So, how do you actually do it? Read a file one line at a time, that is. It’s so simple, it’s beautiful.

line_number = 0

with open('examples/favorite-people.txt', encoding='utf-8') as a_file:

①

for a_line in a_file:

②

line_number += 1

print('{:>4} {}'.format(line_number, a_line.rstrip()))

③

1. Using the with pattern, you safely open the file and let Python close it for you.

2. To read a file one line at a time, use a for loop. That’s it. Besides having explicit methods like read(), the

stream object is also an iterator which spits out a single line every time you ask for a value.

3. Using the format() string method, you can print out the line number and the line itself. The format specifier

{:>4} means “print this argument right-justified within 4 spaces.” The a_line variable contains the complete

line, carriage returns and all. The rstrip() string method removes the trailing whitespace, including the

carriage return characters.

you@localhost:~/diveintopython3$ python3 examples/oneline.py

1 Dora

2 Ethan

3 Wesley

4 John

5 Anne

6 Mike

7 Chris

8 Sarah

9 Alex

10 Lizzie

274

Did you get this error?

you@localhost:~/diveintopython3$ python3 examples/oneline.py

Traceback (most recent call last):

File "examples/oneline.py", line 4, in <module>

print('{:>4} {}'.format(line_number, a_line.rstrip()))

ValueError: zero length field name in format

If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1.

Python 3.0 supported string formatting, but only with explicitly numbered format specifiers.

Python 3.1 allows you to omit the argument indexes in your format specifiers. Here is the

Python 3.0-compatible version for comparison:

print('{0:>4} {1}'.format(line_number, a_line.rstrip()))

⁂

11.3. WRITING TO TEXT FILES

You can write to files in much the same way that you

read from them. First you open a file and get a stream

object, then you use methods on the stream object to

write data to the file, then you close the file.

Just open a

To open a file for writing, use the open() function and

specify the write mode. There are two file modes for

file and

writing:

• “Write” mode will overwrite the file. Pass mode='w' to

the open() function.

275

• “Append” mode will add data to the end of the file.

Pass mode='a' to the open() function.

Either mode will create the file automatically if it doesn’t

already exist, so there’s never a need for any sort of

start

fiddly “if the file doesn’t exist yet, create a new empty

file just so you can open it for the first time” function.

writing.

Just open a file and start writing.

You should always close a file as soon as you’re done

writing to it, to release the file handle and ensure that

the data is actually written to disk. As with reading data from a file, you can call the stream object’s close()

method, or you can use the with statement and let Python close the file for you. I bet you can guess which

technique I recommend.

>>> with open('test.log', mode='w', encoding='utf-8') as a_file:

①

...

a_file.write('test succeeded')

②

>>> with open('test.log', encoding='utf-8') as a_file:

...

print(a_file.read())

test succeeded

>>> with open('test.log', mode='a', encoding='utf-8') as a_file:

③

...

a_file.write('and again')

>>> with open('test.log', encoding='utf-8') as a_file:

...

print(a_file.read())

test succeededand again

④

1. You start boldly by creating the new file test.log (or overwriting the existing file), and opening the file for

writing. The mode='w' parameter means open the file for writing. Yes, that’s all as dangerous as it sounds. I

hope you didn’t care about the previous contents of that file (if any), because that data is gone now.

2. You can add data to the newly opened file with the write() method of the stream object returned by the

open() function. After the with block ends, Python automatically closes the file.

3. That was so fun, let’s do it again. But this time, with mode='a' to append to the file instead of overwriting

it. Appending will never harm the existing contents of the file.

276

阅读 ‧ 电子书库

4. Both the original line you wrote and the second line you appended are now in the file test.log. Also note

that neither carriage returns nor line feeds are included. Since you didn’t write them explicitly to the file

either time, the file doesn’t include them. You can write a carriage return with the '\r' character, and/or a

line feed with the '\n' character. Since you didn’t do either, everything you wrote to the file ended up on

one line.

11.3.1. CHARACTER ENCODING AGAIN

Did you notice the encoding parameter that got passed in to the open() function while you were opening a

file for writing? It’s important; don’t ever leave it out! As you saw in the beginning of this chapter, files don’t contain strings, they contain bytes. Reading a “string” from a text file only works because you told Python

what encoding to use to read a stream of bytes and convert it to a string. Writing text to a file presents the

same problem in reverse. You can’t write characters to a file; characters are an abstraction. In order to write to the file, Python needs to know how to convert your string into a sequence of bytes. The only way

to be sure it’s performing the correct conversion is to specify the encoding parameter when you open the

file for writing.

⁂

11.4. BINARY FILES

Not all files contain text. Some of them contain pictures of my dog.

277

>>> an_image = open('examples/beauregard.jpg', mode='rb')

①

>>> an_image.mode

②

'rb'

>>> an_image.name

③

'examples/beauregard.jpg'

>>> an_image.encoding

④

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

AttributeError: '_io.BufferedReader' object has no attribute 'encoding'

1. Opening a file in binary mode is simple but subtle. The only difference from opening it in text mode is that

the mode parameter contains a 'b' character.

2. The stream object you get from opening a file in binary mode has many of the same attributes, including

mode, which reflects the mode parameter you passed into the open() function.

3. Binary stream objects also have a name attribute, just like text stream objects.

4. Here’s one difference, though: a binary stream object has no encoding attribute. That makes sense, right?

You’re reading (or writing) bytes, not strings, so there’s no conversion for Python to do. What you get out

of a binary file is exactly what you put into it, no conversion necessary.

Did I mention you’re reading bytes? Oh yes you are.

278

# continued from the previous example

>>> an_image.tell()

>>> data = an_image.read(3)

①

>>> data

b'\xff\xd8\xff'

>>> type(data)

②

>>> an_image.tell()

③

>>> an_image.seek(0)

>>> data = an_image.read()

>>> len(data)

3150

1. Like text files, you can read binary files a little bit at a time. But there’s a crucial difference…

2. …you’re reading bytes, not strings. Since you opened the file in binary mode, the read() method takes the

number of bytes to read, not the number of characters.

3. That means that there’s never an unexpected mismatch between the number you passed into the read() method and the position index you get out of the tell() method. The read() method reads bytes, and the

seek() and tell() methods track the number of bytes read. For binary files, they’ll always agree.

⁂

279

11.5. STREAM OBJECTS FROM NON-FILE SOURCES

Imagine you’re writing a library, and one of your library

functions is going to read some data from a file. The

function could simply take a filename as a string, go

open the file for reading, read it, and close it before

exiting. But you shouldn’t do that. Instead, your API

To read

should take an arbitrary stream object.

from a fake

In the simplest case, a stream object is anything with a

read() method which takes an optional size parameter

file, just call

and returns a string. When called with no size

parameter, the read() method should read everything

read() .

there is to read from the input source and return all

the data as a single value. When called with a size

parameter, it reads that much from the input source

and returns that much data. When called again, it picks

up where it left off and returns the next chunk of data.

That sounds exactly like the stream object you get from opening a real file. The difference is that you’re not

limiting yourself to real files. The input source that’s being “read” could be anything: a web page, a string in

memory, even the output of another program. As long as your functions take a stream object and simply call

the object’s read() method, you can handle any input source that acts like a file, without specific code to

handle each kind of input.

280

>>> a_string = 'PapayaWhip is the new black.'

>>> import io

①

>>> a_file = io.StringIO(a_string)

②

>>> a_file.read()

③

'PapayaWhip is the new black.'

>>> a_file.read()

④

>>> a_file.seek(0)

⑤

>>> a_file.read(10)

⑥

'PapayaWhip'

>>> a_file.tell()

>>> a_file.seek(18)

>>> a_file.read()

'new black.'

1. The io module defines the StringIO class that you can use to treat a string in memory as a file.

2. To create a stream object out of a string, create an instance of the io.StringIO() class and pass it the

string you want to use as your “file” data. Now you have a stream object, and you can do all sorts of

stream-like things with it.

3. Calling the read() method “reads” the entire “file,” which in the case of a StringIO object simply returns

the original string.

4. Just like a real file, calling the read() method again returns an empty string.

5. You can explicitly seek to the beginning of the string, just like seeking through a real file, by using the

seek() method of the StringIO object.

6. You can also read the string in chunks, by passing a size parameter to the read() method.

☞ io.StringIO lets you treat a string as a text file. There’s also a io.BytesIO class,

which lets you treat a byte array as a binary file.

281

11.5.1. HANDLING COMPRESSED FILES

The Python standard library contains modules that support reading and writing compressed files. There are a

number of different compression schemes; the two most popular on non-Windows systems are gzip and

bzip2. (You may have also encountered PKZIP archives and GNU Tar archives. Python has modules for those, too.)

The gzip module lets you create a stream object for reading or writing a gzip-compressed file. The stream

object it gives you supports the read() method (if you opened it for reading) or the write() method (if

you opened it for writing). That means you can use the methods you’ve already learned for regular files to

directly read or write a gzip-compressed file, without creating a temporary file to store the decompressed data.

As an added bonus, it supports the with statement too, so you can let Python automatically close your gzip-

compressed file when you’re done with it.

you@localhost:~$ python3

>>> import gzip

>>> with gzip.open('out.log.gz', mode='wb') as z_file:

①

...

z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))

...

>>> exit()

you@localhost:~$ ls -l out.log.gz

②

-rw-r--r-- 1 mark mark 79 2009-07-19 14:29 out.log.gz

you@localhost:~$ gunzip out.log.gz

③

you@localhost:~$ cat out.log

④

A nine mile walk is no joke, especially in the rain.

1. You should always open gzipped files in binary mode. (Note the 'b' character in the mode argument.)

2. I constructed this example on Linux. If you’re not familiar with the command line, this command is showing

the “long listing” of the gzip-compressed file you just created in the Python Shell. This listing shows that the

file exists (good), and that it is 79 bytes long. That’s actually larger than the string you started with! The gzip

282

file format includes a fixed-length header that contains some metadata about the file, so it’s inefficient for

extremely small files.

3. The gunzip command (pronounced “gee-unzip”) decompresses the file and stores the contents in a new file

named the same as the compressed file but without the .gz file extension.

4. The cat command displays the contents of a file. This file contains the string you originally wrote directly to

the compressed file out.log.gz from within the Python Shell.

Did you get this error?

>>> with gzip.open('out.log.gz', mode='wb') as z_file:

...

z_file.write('A nine mile walk is no joke, especially in the rain.'.encode('utf-8'))

...

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

AttributeError: 'GzipFile' object has no attribute '__exit__'

If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1.

Python 3.0 had a gzip module, but it did not support using a gzipped-file object as a

context manager. Python 3.1 added the ability to use gzipped-file objects in a with

statement.

⁂

283

11.6. STANDARD INPUT, OUTPUT, AND ERROR

Command-line gurus are already familiar with the

concept of standard input, standard output, and standard

error. This section is for the rest of you.

Standard output and standard error (commonly

sys.stdin ,

abbreviated stdout and stderr) are pipes that are built

into every UNIX-like system, including Mac OS X and

sys.stdout ,

Linux. When you call the print() function, the thing

you’re printing is sent to the stdout pipe. When your

sys.stderr .

program crashes and prints out a traceback, it goes to

the stderr pipe. By default, both of these pipes are just

connected to the terminal window where you are

working; when your program prints something, you see

the output in your terminal window, and when a program crashes, you see the traceback in your terminal

window too. In the graphical Python Shell, the stdout and stderr pipes default to your “Interactive

Window”.

>>> for i in range(3):

...

print('PapayaWhip')

①

PapayaWhip

>>> import sys

>>> for i in range(3):

... sys.stdout.write('is the')

②

is theis theis the

>>> for i in range(3):

... sys.stderr.write('new black')

③

new blacknew blacknew black

1. The print() function, in a loop. Nothing surprising here.

284

2. stdout is defined in the sys module, and it is a stream object. Calling its write() function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to

the end of the string you’re printing, and calls sys.stdout.write.

3. In the simplest case, sys.stdout and sys.stderr send their output to the same place: the Python IDE (if

you’re in one), or the terminal (if you’re running Python from the command line). Like standard output,

standard error does not add carriage returns for you. If you want carriage returns, you’ll need to write

carriage return characters.

sys.stdout and sys.stderr are stream objects, but they are write-only. Attempting to call their read()

method will always raise an IOError.

>>> import sys

>>> sys.stdout.read()

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

IOError: not readable

11.6.1. REDIRECTING STANDARD OUTPUT

sys.stdout and sys.stderr are stream objects, albeit ones that only support writing. But they’re not

constants; they’re variables. That means you can assign them a new value — any other stream object — to

redirect their output.

285

import sys

class RedirectStdoutTo:

def __init__(self, out_new):

self.out_new = out_new

def __enter__(self):

self.out_old = sys.stdout

sys.stdout = self.out_new

def __exit__(self, *args):

sys.stdout = self.out_old

print('A')

with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):

print('B')

print('C')

Check this out:

you@localhost:~/diveintopython3/examples$ python3 stdout.py

you@localhost:~/diveintopython3/examples$ cat out.log

Did you get this error?

you@localhost:~/diveintopython3/examples$ python3 stdout.py

File "stdout.py", line 15

with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):

SyntaxError: invalid syntax

286

If so, you’re probably using Python 3.0. You should really upgrade to Python 3.1.

Python 3.0 supported the with statement, but each statement can only use one context

manager. Python 3.1 allows you to chain multiple context managers in a single with

statement.

Let’s take the last part first.

print('A')

with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):

print('B')

print('C')

That’s a complicated with statement. Let me rewrite it as something more recognizable.

with open('out.log', mode='w', encoding='utf-8') as a_file:

with RedirectStdoutTo(a_file):

print('B')

As the rewrite shows, you actually have two with statements, one nested within the scope of the other. The

“outer” with statement should be familiar by now: it opens a UTF-8-encoded text file named out.log for

writing and assigns the stream object to a variable named a_file. But that’s not the only thing odd here.

with RedirectStdoutTo(a_file):

Where’s the as clause? The with statement doesn’t actually require one. Just like you can call a function and

ignore its return value, you can have a with statement that doesn’t assign the with context to a variable. In

this case, you’re only interested in the side effects of the RedirectStdoutTo context.

What are those side effects? Take a look inside the RedirectStdoutTo class. This class is a custom context

manager. Any class can be a context manager by defining two special methods: __enter__() and __exit__().

287

class RedirectStdoutTo:

def __init__(self, out_new):

①

self.out_new = out_new

def __enter__(self):

②

self.out_old = sys.stdout

sys.stdout = self.out_new

def __exit__(self, *args):

③

sys.stdout = self.out_old

1. The __init__() method is called immediately after an instance is created. It takes one parameter, the

stream object that you want to use as standard output for the life of the context. This method just saves

the stream object in an instance variable so other methods can use it later.

2. The __enter__() method is a special class method; Python calls it when entering a context ( i.e. at the beginning of the with statement). This method saves the current value of sys.stdout in self.out_old,

then redirects standard output by assigning self.out_new to sys.stdout.

3. The __exit__() method is another special class method; Python calls it when exiting the context ( i.e. at the

end of the with statement). This method restores standard output to its original value by assigning the saved

self.out_old value to sys.stdout.

Putting it all together:

print('A')

①

with open('out.log', mode='w', encoding='utf-8') as a_file, RedirectStdoutTo(a_file):

②

print('B')

③

print('C')

④

1. This will print to the IDE “Interactive Window” (or the terminal, if running the script from the command

line).

2. This with statement takes a comma-separated list of contexts. The comma-separated list acts like a series of nested with blocks. The first context listed is the “outer” block; the last one listed is the “inner” block. The

first context opens a file; the second context redirects sys.stdout to the stream object that was created in

the first context.

288

3. Because this print() function is executed with the context created by the with statement, it will not print

to the screen; it will write to the file out.log.

4. The with code block is over. Python has told each context manager to do whatever it is they do upon

exiting a context. The context managers form a last-in-first-out stack. Upon exiting, the second context

changed sys.stdout back to its original value, then the first context closed the file named out.log. Since

standard output has been restored to its original value, calling the print() function will once again print to

the screen.

Redirecting standard error works exactly the same way, using sys.stderr instead of sys.stdout.

⁂

11.7. FURTHER READING

• Reading and writing files in the Python.org tutorial

• io module

• Stream objects

• Context manager types

• sys.stdout and sys.stderr

• FUSE on Wikipedia

289

CHAPTER 12. XML

❝ In the archonship of Aristaechmus, Draco enacted his ordinances. ❞

— Aristotle

12.1. DIVING IN

Nearlyallthechaptersinthisbookrevolvearoundapieceofsamplecode.ButXML isn’taboutcode;

it’s about data. One common use of XML is “syndication feeds” that list the latest articles on a blog, forum,

or other frequently-updated website. Most popular blogging software can produce a feed and update it

whenever new articles, discussion threads, or blog posts are published. You can follow a blog by

“subscribing” to its feed, and you can follow multiple blogs with a dedicated “feed aggregator” like Google

Reader.

Here, then, is the XML data we’ll be working with in this chapter. It’s a feed — specifically, an Atom

syndication feed.

290

<?xml version='1.0' encoding='utf-8'?>

<subtitle>currently between addictions</subtitle>

<id>tag:diveintomark.org,2001-07-29:/</id>

<entry>

<uri>http://diveintomark.org/</uri>

</author>

<title>Dive into history, 2009 edition</title>

href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>

<id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>

<summary type='html'>Putting an entire chapter on one page sounds

bloated, but consider this &mdash; my longest chapter so far

would be 75 printed pages, and it loads in under 5 seconds&hellip;

On dialup.</summary>

</entry>

<entry>

<uri>http://diveintomark.org/</uri>

</author>

<title>Accessibility is a harsh mistress</title>

291

href='http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'/>

<id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>

<summary type='html'>The accessibility orthodoxy does not permit people to

question the value of features that are rarely useful and rarely used.</summary>

</entry>

<entry>

</author>

<title>A gentle introduction to video encoding, part 1: container formats</title>

href='http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'/>

<id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>

<summary type='html'>These notes will eventually become part of a

tech talk on video encoding.</summary>

</entry>

</feed>

⁂

292

12.2. A 5-MINUTE CRASH COURSE IN XML

If you already know about XML, you can skip this section.

X M L is a generalized way of describing hierarchical structured data. An X M L document contains one or more

elements, which are delimited by start and end tags. This is a complete (albeit boring) XML document:

<foo>

①

</foo>

②

1. This is the start tag of the foo element.

2. This is the matching end tag of the foo element. Like balancing parentheses in writing or mathematics or

code, every start tag must be closed (matched) by a corresponding end tag.

Elements can be nested to any depth. An element bar inside an element foo is said to be a subelement or

child of foo.

<foo>

</foo>

The first element in every XML document is called the root element. An XML document can only have one

root element. The following is not an XML document, because it has two root elements:

Elements can have attributes, which are name-value pairs. Attributes are listed within the start tag of an

element and separated by whitespace. Attribute names can not be repeated within an element. Attribute values

must be quoted. You may use either single or double quotes.

①

②

</foo>

293

1. The foo element has one attribute, named lang. The value of its lang attribute is en.

2. The bar element has two attributes, named id and lang. The value of its lang attribute is fr. This doesn’t

conflict with the foo element in any way. Each element has its own set of attributes.

If an element has more than one attribute, the ordering of the attributes is not significant. An element’s

attributes form an unordered set of keys and values, like a Python dictionary. There is no limit to the

number of attributes you can define on each element.

Elements can have text content.

<bar lang='fr'>PapayaWhip</bar>

</foo>

Elements that contain no text and no children are empty.

There is a shorthand for writing empty elements. By putting a / character in the start tag, you can skip the

end tag altogther. The XML document in the previous example could be written like this instead:

<foo/>

Like Python functions can be declared in different modules, XML elements can be declared in different

namespaces. Namespaces usually look like URLs. You use an xmlns declaration to define a default namespace.

A namespace declaration looks similar to an attribute, but it has a different purpose.

①

②

</feed>

1. The feed element is in the http://www.w3.org/2005/Atom namespace.

2. The title element is also in the http://www.w3.org/2005/Atom namespace. The namespace declaration

affects the element where it’s declared, plus all child elements.

294

You can also use an xmlns:prefix declaration to define a namespace and associate it with a prefix. Then

each element in that namespace must be explicitly declared with the prefix.

<atom:feed xmlns:atom='http://www.w3.org/2005/Atom'>

①

<atom:title>dive into mark</atom:title>

②

</atom:feed>

1. The feed element is in the http://www.w3.org/2005/Atom namespace.

2. The title element is also in the http://www.w3.org/2005/Atom namespace.

As far as an XML parser is concerned, the previous two XML documents are identical. Namespace + element

name = XML identity. Prefixes only exist to refer to namespaces, so the actual prefix name (atom:) is

irrelevant. The namespaces match, the element names match, the attributes (or lack of attributes) match, and

each element’s text content matches, therefore the XML documents are the same.

Finally, XML documents can contain character encoding information on the first line, before the root element. (If you’re curious how a document can contain information which needs to be known before the

document can be parsed, Section F of the XML specification details how to resolve this Catch-22.)

<?xml version='1.0' encoding='utf-8'?>

And now you know just enough XML to be dangerous!

⁂

12.3. THE STRUCTURE OF AN ATOM FEED

Think of a weblog, or in fact any website with frequently updated content, like CNN.com. The site itself has a title (“CNN.com”), a subtitle (“Breaking News, U.S., World, Weather, Entertainment & Video News”), a

last-updated date (“updated 12:43 p.m. EDT, Sat May 16, 2009”), and a list of articles posted at different

times. Each article also has a title, a first-published date (and maybe also a last-updated date, if they published

a correction or fixed a typo), and a unique URL.

295

The Atom syndication format is designed to capture all of this information in a standard format. My weblog

and CNN.com are wildly different in design, scope, and audience, but they both have the same basic

structure. CNN.com has a title; my blog has a title. CNN.com publishes articles; I publish articles.

At the top level is the root element, which every Atom feed shares: the feed element in the

http://www.w3.org/2005/Atom namespace.

<feed xmlns='http://www.w3.org/2005/Atom'

①

xml:lang='en'>

②

1. http://www.w3.org/2005/Atom is the Atom namespace.

2. Any element can contain an xml:lang attribute, which declares the language of the element and its children.

In this case, the xml:lang attribute is declared once on the root element, which means the entire feed is in

English.

An Atom feed contains several pieces of information about the feed itself. These are declared as children of

the root-level feed element.

①

<subtitle>currently between addictions</subtitle>

②

<id>tag:diveintomark.org,2001-07-29:/</id>

③

④

⑤

1. The title of this feed is dive into mark.

2. The subtitle of this feed is currently between addictions.

3. Every feed needs a globally unique identifier. See RFC 4151 for how to create one.

4. This feed was last updated on March 27, 2009, at 21:56 GMT. This is usually equivalent to the last-modified

date of the most recent article.

5. Now things start to get interesting. This link element has no text content, but it has three attributes: rel,

type, and href. The rel value tells you what kind of link this is; rel='alternate' means that this is a link

to an alternate representation of this feed. The type='text/html' attribute means that this is a link to an

H T M L page. And the link target is given in the href attribute.

296

Now we know that this is a feed for a site named “dive into mark“ which is available at

http://diveintomark.org/ and was last updated on March 27, 2009.

☞ Although the order of elements can be relevant in some XML documents, it is not

relevant in an Atom feed.

After the feed-level metadata is the list of the most recent articles. An article looks like this:

<entry>

①

<uri>http://diveintomark.org/</uri>

</author>

<title>Dive into history, 2009 edition</title>

②

③

href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>

<id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>

④

⑤

⑥

<summary type='html'>Putting an entire chapter on one page sounds

⑦

bloated, but consider this &mdash; my longest chapter so far

would be 75 printed pages, and it loads in under 5 seconds&hellip;

On dialup.</summary>

</entry>

⑧

1. The author element tells who wrote this article: some guy named Mark, whom you can find loafing at

http://diveintomark.org/. (This is the same as the alternate link in the feed metadata, but it doesn’t have

to be. Many weblogs have multiple authors, each with their own personal website.)

2. The title element gives the title of the article, “Dive into history, 2009 edition”.

297

3. As with the feed-level alternate link, this link element gives the address of the HTML version of this article.

4. Entries, like feeds, need a unique identifier.

5. Entries have two dates: a first-published date (published) and a last-modified date (updated).

6. Entries can have an arbitrary number of categories. This article is filed under diveintopython, docbook, and

html.

7. The summary element gives a brief summary of the article. (There is also a content element, not shown

here, if you want to include the complete article text in your feed.) This summary element has the Atom-

specific type='html' attribute, which specifies that this summary is a snippet of HTML, not plain text. This is

important, since it has HTML-specific entities in it (— and …) which should be rendered as

“—” and “…” rather than displayed directly.

8. Finally, the end tag for the entry element, signaling the end of the metadata for this article.

⁂

12.4. PARSING XML

Python can parse XML documents in several ways. It has traditional DOM and SAX parsers, but I will focus on a different library called ElementTree.

>>> import xml.etree.ElementTree as etree

①

>>> tree = etree.parse('examples/feed.xml')

②

>>> root = tree.getroot()

③

>>> root

④

1. The ElementTree library is part of the Python standard library, in xml.etree.ElementTree.

2. The primary entry point for the ElementTree library is the parse() function, which can take a filename or a

file-like object. This function parses the entire document at once. If memory is tight, there are ways to parse

an XML document incrementally instead.

3. The parse() function returns an object which represents the entire document. This is not the root element.

To get a reference to the root element, call the getroot() method.

298

4. As expected, the root element is the feed element in the http://www.w3.org/2005/Atom namespace. The

string representation of this object reinforces an important point: an XML element is a combination of its

namespace and its tag name (also called the local name). Every element in this document is in the Atom

namespace, so the root element is represented as {http://www.w3.org/2005/Atom}feed.

☞ ElementTree represents XML elements as {namespace}localname. You’ll see and use

this format in multiple places in the ElementTree API.

12.4.1. ELEMENTS ARE LISTS

In the ElementTree API, an element acts like a list. The items of the list are the element’s children.

# continued from the previous example

>>> root.tag

①

'{http://www.w3.org/2005/Atom}feed'

>>> len(root)

②

>>> for child in root:

③

...

print(child)

④

...

1. Continuing from the previous example, the root element is {http://www.w3.org/2005/Atom}feed.

2. The “length” of the root element is the number of child elements.

3. You can use the element itself as an iterator to loop through all of its child elements.

299

4. As you can see from the output, there are indeed 8 child elements: all of the feed-level metadata (title,

subtitle, id, updated, and link) followed by the three entry elements.

You may have guessed this already, but I want to point it out explicitly: the list of child elements only

includes direct children. Each of the entry elements contain their own children, but those are not included in

the list. They would be included in the list of each entry’s children, but they are not included in the list of

the feed’s children. There are ways to find elements no matter how deeply nested they are; we’ll look at

two such ways later in this chapter.

12.4.2. ATTRIBUTES ARE DICTONARIES

X M L isn’t just a collection of elements; each element can also have its own set of attributes. Once you have

a reference to a specific element, you can easily get its attributes as a Python dictionary.

# continuing from the previous example

>>> root.attrib

①

{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}

>>> root[4]

②

>>> root[4].attrib

③

{'href': 'http://diveintomark.org/',

'type': 'text/html',

'rel': 'alternate'}

>>> root[3]

④

>>> root[3].attrib

⑤

{}

1. The attrib property is a dictionary of the element’s attributes. The original markup here was <feed

xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>. The xml: prefix refers to a built-in namespace

that every XML document can use without declaring it.

2. The fifth child — [4] in a 0-based list — is the link element.

3. The link element has three attributes: href, type, and rel.

4. The fourth child — [3] in a 0-based list — is the updated element.

300

5. The updated element has no attributes, so its .attrib is just an empty dictionary.

⁂

12.5. SEARCHING FOR NODES WITHIN AN XML DOCUMENT

So far, we’ve worked with this XML document “from the top down,” starting with the root element, getting

its child elements, and so on throughout the document. But many uses of XML require you to find specific

elements. Etree can do that, too.

>>> import xml.etree.ElementTree as etree

>>> tree = etree.parse('examples/feed.xml')

>>> root = tree.getroot()

>>> root.findall('{http://www.w3.org/2005/Atom}entry')

①

[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,

<Element {http://www.w3.org/2005/Atom}entry at e2b510>,

<Element {http://www.w3.org/2005/Atom}entry at e2b540>]

>>> root.tag

'{http://www.w3.org/2005/Atom}feed'

>>> root.findall('{http://www.w3.org/2005/Atom}feed')

②

[]

>>> root.findall('{http://www.w3.org/2005/Atom}author')

③

[]

1. The findall() method finds child elements that match a specific query. (More on the query format in a

minute.)

2. Each element — including the root element, but also child elements — has a findall() method. It finds all

matching elements among the element’s children. But why aren’t there any results? Although it may not be

obvious, this particular query only searches the element’s children. Since the root feed element has no child

named feed, this query returns an empty list.

3. This result may also surprise you. There is an author element in this document; in fact, there are three (one in each entry). But those author elements are not direct children of the root element; they are

301

“grandchildren” (literally, a child element of a child element). If you want to look for author elements at any

nesting level, you can do that, but the query format is slightly different.

>>> tree.findall('{http://www.w3.org/2005/Atom}entry')

①

[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,

<Element {http://www.w3.org/2005/Atom}entry at e2b510>,

<Element {http://www.w3.org/2005/Atom}entry at e2b540>]

>>> tree.findall('{http://www.w3.org/2005/Atom}author')

②

[]

1. For convenience, the tree object (returned from the etree.parse() function) has several methods that

mirror the methods on the root element. The results are the same as if you had called the

tree.getroot().findall() method.

2. Perhaps surprisingly, this query does not find the author elements in this document. Why not? Because this

is just a shortcut for tree.getroot().findall('{http://www.w3.org/2005/Atom}author'), which means

“find all the author elements that are children of the root element.” The author elements are not children

of the root element; they’re children of the entry elements. Thus the query doesn’t return any matches.

There is also a find() method which returns the first matching element. This is useful for situations where

you are only expecting one match, or if there are multiple matches, you only care about the first one.

>>> entries = tree.findall('{http://www.w3.org/2005/Atom}entry')

①

>>> len(entries)

>>> title_element = entries[0].find('{http://www.w3.org/2005/Atom}title')

②

>>> title_element.text

'Dive into history, 2009 edition'

>>> foo_element = entries[0].find('{http://www.w3.org/2005/Atom}foo')

③

>>> foo_element

>>> type(foo_element)

1. You saw this in the previous example. It finds all the atom:entry elements.

2. The find() method takes an ElementTree query and returns the first matching element.

302

3. There are no elements in this entry named foo, so this returns None.

☞ There is a “gotcha” with the find() method that will eventually bite you. In a

boolean context, ElementTree element objects will evaluate to False if they contain

no children ( i.e. if len(element) is 0). This means that if element.find('...') is

not testing whether the find() method found a matching element; it’s testing

whether that matching element has any child elements! To test whether the find()

method returned an element, use if element.find('...') is not None.

There is a way to search for descendant elements, i.e. children, grandchildren, and any element at any nesting level.

303

>>> all_links = tree.findall('//{http://www.w3.org/2005/Atom}link')

①

>>> all_links

[<Element {http://www.w3.org/2005/Atom}link at e181b0>,

<Element {http://www.w3.org/2005/Atom}link at e2b570>,

<Element {http://www.w3.org/2005/Atom}link at e2b480>,

<Element {http://www.w3.org/2005/Atom}link at e2b5a0>]

>>> all_links[0].attrib

②

{'href': 'http://diveintomark.org/',

'type': 'text/html',

'rel': 'alternate'}

>>> all_links[1].attrib

③

{'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',

'type': 'text/html',

'rel': 'alternate'}

>>> all_links[2].attrib

{'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress',

'type': 'text/html',

'rel': 'alternate'}

>>> all_links[3].attrib

{'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats',

'type': 'text/html',

'rel': 'alternate'}

1. This query — //{http://www.w3.org/2005/Atom}link — is very similar to the previous examples, except

for the two slashes at the beginning of the query. Those two slashes mean “don’t just look for direct

children; I want any elements, regardless of nesting level.” So the result is a list of four link elements, not

just one.

2. The first result is a direct child of the root element. As you can see from its attributes, this is the feed-level

alternate link that points to the HTML version of the website that the feed describes.

3. The other three results are each entry-level alternate links. Each entry has a single link child element, and

because of the double slash at the beginning of the query, this query finds all of them.

Overall, ElementTree’s findall() method is a very powerful feature, but the query language can be a bit

surprising. It is officially described as “limited support for XPath expressions.” XPath is a W3C standard for 304

querying XML documents. ElementTree’s query language is similar enough to XPath to do basic searching,

but dissimilar enough that it may annoy you if you already know XPath. Now let’s look at a third-party XML

library that extends the ElementTree API with full XPath support.

⁂

12.6. GOING FURTHER WITH LXML

lxml is an open source third-party library that builds on the popular libxml2 parser. It provides a 100%

compatible ElementTree API, then extends it with full XPath 1.0 support and a few other niceties. There are

installers available for Windows; Linux users should always try to use distribution-specific tools like yum or apt-get to install precompiled binaries from their repositories. Otherwise you’ll need to install lxml

manually.

>>> from lxml import etree

①

>>> tree = etree.parse('examples/feed.xml')

②

>>> root = tree.getroot()

③

>>> root.findall('{http://www.w3.org/2005/Atom}entry')

④

[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,

<Element {http://www.w3.org/2005/Atom}entry at e2b510>,

<Element {http://www.w3.org/2005/Atom}entry at e2b540>]

1. Once imported, lxml provides the same API as the built-in ElementTree library.

2. parse() function: same as ElementTree.

3. getroot() method: also the same.

4. findall() method: exactly the same.

For large XML documents, lxml is significantly faster than the built-in ElementTree library. If you’re only

using the ElementTree API and want to use the fastest available implementation, you can try to import lxml

and fall back to the built-in ElementTree.

305

try:

from lxml import etree

except ImportError:

import xml.etree.ElementTree as etree

But lxml is more than just a faster ElementTree. Its findall() method includes support for more

complicated expressions.

>>> import lxml.etree

①

>>> tree = lxml.etree.parse('examples/feed.xml')

>>> tree.findall('//{http://www.w3.org/2005/Atom}*[@href]')

②

[<Element {http://www.w3.org/2005/Atom}link at eeb8a0>,

<Element {http://www.w3.org/2005/Atom}link at eeb990>,

<Element {http://www.w3.org/2005/Atom}link at eeb960>,

<Element {http://www.w3.org/2005/Atom}link at eeb9c0>]

>>> tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']")

③

[<Element {http://www.w3.org/2005/Atom}link at eeb930>]

>>> NS = '{http://www.w3.org/2005/Atom}'

>>> tree.findall('//{NS}author[{NS}uri]'.format(NS=NS))

④

[<Element {http://www.w3.org/2005/Atom}author at eeba80>,

<Element {http://www.w3.org/2005/Atom}author at eebba0>]

1. In this example, I’m going to import lxml.etree (instead of, say, from lxml import etree), to emphasize

that these features are specific to lxml.

2. This query finds all elements in the Atom namespace, anywhere in the document, that have an href

attribute. The // at the beginning of the query means “elements anywhere (not just as children of the root

element).” {http://www.w3.org/2005/Atom} means “only elements in the Atom namespace.” * means

“elements with any local name.” And [@href] means “has an href attribute.”

3. The query finds all Atom elements with an href whose value is http://diveintomark.org/.

4. After doing some quick string formatting (because otherwise these compound queries get ridiculously long), this query searches for Atom author elements that have an Atom uri element as a child. This only returns

two author elements, the ones in the first and second entry. The author in the last entry contains only a

name, not a uri.

306

Not enough for you? lxml also integrates support for arbitrary XPath 1.0 expressions. I’m not going to go

into depth about XPath syntax; that could be a whole book unto itself! But I will show you how it integrates

into lxml.

>>> import lxml.etree

>>> tree = lxml.etree.parse('examples/feed.xml')

>>> NSMAP = {'atom': 'http://www.w3.org/2005/Atom'}

①

>>> entries = tree.xpath("//atom:category[@term='accessibility']/..",

②

...

namespaces=NSMAP)

>>> entries

③

[<Element {http://www.w3.org/2005/Atom}entry at e2b630>]

>>> entry = entries[0]

>>> entry.xpath('./atom:title/text()', namespaces=NSMAP)

④

['Accessibility is a harsh mistress']

1. To perform XPath queries on namespaced elements, you need to define a namespace prefix mapping. This is

just a Python dictionary.

2. Here is an XPath query. The XPath expression searches for category elements (in the Atom namespace)

that contain a term attribute with the value accessibility. But that’s not actually the query result. Look at

the very end of the query string; did you notice the /.. bit? That means “and then return the parent

element of the category element you just found.” So this single XPath query will find all entries with a child

element of <category term='accessibility'>.

3. The xpath() function returns a list of ElementTree objects. In this document, there is only one entry with a

category whose term is accessibility.

4. XPath expressions don’t always return a list of elements. Technically, the DOM of a parsed XML document

doesn’t contain elements; it contains nodes. Depending on their type, nodes can be elements, attributes, or

even text content. The result of an XPath query is a list of nodes. This query returns a list of text nodes:

the text content (text()) of the title element (atom:title) that is a child of the current element (./).

⁂

307

12.7. GENERATING XML

Python’s support for XML is not limited to parsing existing documents. You can also create XML documents

from scratch.

>>> import xml.etree.ElementTree as etree

>>> new_feed = etree.Element('{http://www.w3.org/2005/Atom}feed',

①

...

attrib={'{http://www.w3.org/XML/1998/namespace}lang': 'en'})

②

>>> print(etree.tostring(new_feed))

③

<ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/>

1. To create a new element, instantiate the Element class. You pass the element name (namespace + local

name) as the first argument. This statement creates a feed element in the Atom namespace. This will be our

new document’s root element.

2. To add attributes to the newly created element, pass a dictionary of attribute names and values in the

attrib argument. Note that the attribute name should be in the standard ElementTree format,

{namespace}localname.

3. At any time, you can serialize any element (and its children) with the ElementTree tostring() function.

Was that serialization surprising to you? The way ElementTree serializes namespaced XML elements is

technically accurate but not optimal. The sample XML document at the beginning of this chapter defined a

default namespace (xmlns='http://www.w3.org/2005/Atom'). Defining a default namespace is useful for

documents — like Atom feeds — where every element is in the same namespace, because you can declare

the namespace once and declare each element with just its local name (<feed>, <link>, <entry>). There is

no need to use any prefixes unless you want to declare elements from another namespace.

An XML parser won’t “see” any difference between an XML document with a default namespace and an XML

document with a prefixed namespace. The resulting DOM of this serialization:

<ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/>

is identical to the DOM of this serialization:

308

The only practical difference is that the second serialization is several characters shorter. If we were to

recast our entire sample feed with a ns0: prefix in every start and end tag, it would add 4 characters per

start tag × 79 tags + 4 characters for the namespace declaration itself, for a total of 320 characters.

Assuming UTF-8 encoding, that’s 320 extra bytes. (After gzipping, the difference drops to 21 bytes, but still, 21 bytes is 21 bytes.) Maybe that doesn’t matter to you, but for something like an Atom feed, which may be

downloaded several thousand times whenever it changes, saving a few bytes per request can quickly add up.

The built-in ElementTree library does not offer this fine-grained control over serializing namespaced

elements, but lxml does.

>>> import lxml.etree

>>> NSMAP = {None: 'http://www.w3.org/2005/Atom'}

①

>>> new_feed = lxml.etree.Element('feed', nsmap=NSMAP)

②

>>> print(lxml.etree.tounicode(new_feed))

③

>>> new_feed.set('{http://www.w3.org/XML/1998/namespace}lang', 'en')

④

>>> print(lxml.etree.tounicode(new_feed))

1. To start, define a namespace mapping as a dictionary. Dictionary values are namespaces; dictionary keys are

the desired prefix. Using None as a prefix effectively declares a default namespace.

2. Now you can pass the lxml-specific nsmap argument when you create an element, and lxml will respect the

namespace prefixes you’ve defined.

3. As expected, this serialization defines the Atom namespace as the default namespace and declares the feed

element without a namespace prefix.

4. Oops, we forgot to add the xml:lang attribute. You can always add attributes to any element with the

set() method. It takes two arguments: the attribute name in standard ElementTree format, then the

attribute value. (This method is not lxml-specific. The only lxml-specific part of this example was the nsmap

argument to control the namespace prefixes in the serialized output.)

Are XML documents limited to one element per document? No, of course not. You can easily create child

elements, too.

309

>>> title = lxml.etree.SubElement(new_feed, 'title',

①

...

attrib={'type':'html'})

②

>>> print(lxml.etree.tounicode(new_feed))

③

>>> title.text = 'dive into …'

④

>>> print(lxml.etree.tounicode(new_feed))

⑤

<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'><title type='html'>dive into &hellip;</title></feed>

>>> print(lxml.etree.tounicode(new_feed, pretty_print=True))

⑥

<title type='html'>dive into&hellip;</title>

</feed>

1. To create a child element of an existing element, instantiate the SubElement class. The only required

arguments are the parent element (new_feed in this case) and the new element’s name. Since this child

element will inherit the namespace mapping of its parent, there is no need to redeclare the namespace or

prefix here.

2. You can also pass in an attribute dictionary. Keys are attribute names; values are attribute values.

3. As expected, the new title element was created in the Atom namespace, and it was inserted as a child of

the feed element. Since the title element has no text content and no children of its own, lxml serializes it

as an empty element (with the /> shortcut).

4. To set the text content of an element, simply set its .text property.

5. Now the title element is serialized with its text content. Any text content that contains less-than signs or

ampersands needs to be escaped when serialized. lxml handles this escaping automatically.

6. You can also apply “pretty printing” to the serialization, which inserts line breaks after end tags, and after

start tags of elements that contain child elements but no text content. In technical terms, lxml adds

“insignificant whitespace” to make the output more readable.

☞ You might also want to check out xmlwitch, another third-party library for generating X M L . It makes extensive use of the with statement to make X M L generation code

more readable.

⁂

310

12.8. PARSING BROKEN XML

The XML specification mandates that all conforming XML parsers employ “draconian error handling.” That is,

they must halt and catch fire as soon as they detect any sort of wellformedness error in the XML document.

Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters,

and a number of other esoteric rules. This is in stark contrast to other common formats like HTML — your

browser doesn’t stop rendering a web page if you forget to close an HTML tag or escape an ampersand in

an attribute value. (It is a common misconception that HTML has no defined error handling. HTML error

handling is actually quite well-defined, but it’s significantly more complicated than “halt and catch fire on first error.”)

Some people (myself included) believe that it was a mistake for the inventors of XML to mandate draconian

error handling. Don’t get me wrong; I can certainly see the allure of simplifying the error handling rules. But

in practice, the concept of “wellformedness” is trickier than it sounds, especially for XML documents (like

Atom feeds) that are published on the web and served over HTTP. Despite the maturity of XML, which

standardized on draconian error handling in 1997, surveys continually show a significant fraction of Atom

feeds on the web are plagued with wellformedness errors.

So, I have both theoretical and practical reasons to parse XML documents “at any cost,” that is, not to halt

and catch fire at the first wellformedness error. If you find yourself wanting to do this too, lxml can help.

Here is a fragment of a broken XML document. I’ve highlighted the wellformedness error.

<?xml version='1.0' encoding='utf-8'?>

...

</feed>

That’s an error, because the … entity is not defined in XML. (It is defined in HTML.) If you try to

parse this broken feed with the default settings, lxml will choke on the undefined entity.

311

>>> import lxml.etree

>>> tree = lxml.etree.parse('examples/feed-broken.xml')

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "lxml.etree.pyx", line 2693, in lxml.etree.parse (src/lxml/lxml.etree.c:52591)

File "parser.pxi", line 1478, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:75665)

File "parser.pxi", line 1507, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:75993)

File "parser.pxi", line 1407, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:75002)

File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:72023)

File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:67830)

File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:68877)

File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:68125)

lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28

To parse this broken XML document, despite its wellformedness error, you need to create a custom XML

parser.

>>> parser = lxml.etree.XMLParser(recover=True)

①

>>> tree = lxml.etree.parse('examples/feed-broken.xml', parser)

②

>>> parser.error_log

③

examples/feed-broken.xml:3:28:FATAL:PARSER:ERR_UNDECLARED_ENTITY: Entity 'hellip' not defined

>>> tree.findall('{http://www.w3.org/2005/Atom}title')

[<Element {http://www.w3.org/2005/Atom}title at ead510>]

>>> title = tree.findall('{http://www.w3.org/2005/Atom}title')[0]

>>> title.text

④

'dive into '

>>> print(lxml.etree.tounicode(tree.getroot()))

⑤

. [rest of serialization snipped for brevity]

312

1. To create a custom parser, instantiate the lxml.etree.XMLParser class. It can take a number of different

named arguments. The one we’re interested in here is the recover argument. When set to True, the XML

parser will try its best to “recover” from wellformedness errors.

2. To parse an XML document with your custom parser, pass the parser object as the second argument to

the parse() function. Note that lxml does not raise an exception about the undefined … entity.

3. The parser keeps a log of the wellformedness errors that it has encountered. (This is actually true regardless

of whether it is set to recover from those errors or not.)

4. Since it didn’t know what to do with the undefined … entity, the parser just silently dropped it. The

text content of the title element becomes 'dive into '.

5. As you can see from the serialization, the … entity didn’t get moved; it was simply dropped.

It is important to reiterate that there is no guarantee of interoperability with “recovering” XML parsers.

A different parser might decide that it recognized the … entity from HTML, and replace it with

&hellip; instead. Is that “better”? Maybe. Is it “more correct”? No, they are both equally incorrect.

The correct behavior (according to the XML specification) is to halt and catch fire. If you’ve decided not to

do that, you’re on your own.

⁂

12.9. FURTHER READING

• XML on Wikipedia.org

• The ElementTree XML API

• Elements and Element Trees

• XPath Support in ElementTree

• The ElementTree iterparse Function

• lxml

• Parsing XML and HTML with lxml

• XPath and XSLT with lxml

• xmlwitch

313

CHAPTER 13. SERIALIZING PYTHON OBJECTS

❝ Every Saturday since we’ve lived in this apartment, I have awakened at 6:15, poured myself a bowl of cereal,

added

a quarter-cup of 2% milk, sat on this end of this couch, turned on BBC America, and watched Doctor Who. ❞

— Sheldon, The Big Bang Theory

13.1. DIVING IN

Onthesurface,theconceptofserializationissimple.Youhaveadatastructureinmemorythatyou

want to save, reuse, or send to someone else. How would you do that? Well, that depends on how you

want to save it, how you want to reuse it, and to whom you want to send it. Many games allow you to save

your progress when you quit the game and pick up where you left off when you relaunch the game.

(Actually, many non-gaming applications do this as well.) In this case, a data structure that captures “your

progress so far” needs to be stored on disk when you quit, then loaded from disk when you relaunch. The

data is only meant to be used by the same program that created it, never sent over a network, and never

read by anything other than the program that created it. Therefore, the interoperability issues are limited to

ensuring that later versions of the program can read data written by earlier versions.

For cases like this, the pickle module is ideal. It’s part of the Python standard library, so it’s always

available. It’s fast; the bulk of it is written in C, like the Python interpreter itself. It can store arbitrarily

complex Python data structures.

What can the pickle module store?

• All the native datatypes that Python supports: booleans, integers, floating point numbers, complex numbers, strings, bytes objects, byte arrays, and None.

• Lists, tuples, dictionaries, and sets containing any combination of native datatypes.

• Lists, tuples, dictionaries, and sets containing any combination of lists, tuples, dictionaries, and sets containing

any combination of native datatypes (and so on, to the maximum nesting level that Python supports).

314

• Functions, classes, and instances of classes (with caveats).

If this isn’t enough for you, the pickle module is also extensible. If you’re interested in extensibility, check

out the links in the Further Reading section at the end of the chapter.

13.1.1. A QUICK NOTE ABOUT THE EXAMPLES IN THIS CHAPTER

This chapter tells a tale with two Python Shells. All of the examples in this chapter are part of a single story

arc. You will be asked to switch back and forth between the two Python Shells as I demonstrate the pickle

and json modules.

To help keep things straight, open the Python Shell and define the following variable:

>>> shell = 1

Keep that window open. Now open another Python Shell and define the following variable:

>>> shell = 2

Throughout this chapter, I will use the shell variable to indicate which Python Shell is being used in each

example.

⁂

13.2. SAVING DATA TO A PICKLE FILE

The pickle module works with data structures. Let’s build one.

315

>>> shell

①

>>> entry = {}

②

>>> entry['title'] = 'Dive into history, 2009 edition'

>>> entry['article_link'] = 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'

>>> entry['comments_link'] = None

>>> entry['internal_id'] = b'\xDE\xD5\xB4\xF8'

>>> entry['tags'] = ('diveintopython', 'docbook', 'html')

>>> entry['published'] = True

>>> import time

>>> entry['published_date'] = time.strptime('Fri Mar 27 22:20:42 2009')

③

>>> entry['published_date']

time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1)

1. Follow along in Python Shell #1.

2. The idea here is to build a Python dictionary that could represent something useful, like an entry in an Atom

feed. But I also want to ensure that it contains several different types of data, to show off the pickle module. Don’t read too much into these values.

3. The time module contains a data structure (time_struct) to represent a point in time (accurate to one

millisecond) and functions to manipulate time structs. The strptime() function takes a formatted string an

converts it to a time_struct. This string is in the default format, but you can control that with format

codes. See the time module for more details.

That’s a handsome-looking Python dictionary. Let’s save it to a file.

>>> shell

①

>>> import pickle

>>> with open('entry.pickle', 'wb') as f:

②

...

pickle.dump(entry, f)

③

...

1. This is still in Python Shell #1.

316

2. Use the open() function to open a file. Set the file mode to 'wb' to open the file for writing in binary

mode. Wrap it in a with statement to ensure the file is closed automatically when you’re done with it.

3. The dump() function in the pickle module takes a serializable Python data structure, serializes it into a

binary, Python-specific format using the latest version of the pickle protocol, and saves it to an open file.

That last sentence was pretty important.

• The pickle module takes a Python data structure and saves it to a file.

• To do this, it serializes the data structure using a data format called “the pickle protocol.”

• The pickle protocol is Python-specific; there is no guarantee of cross-language compatibility. You probably

couldn’t take the entry.pickle file you just created and do anything useful with it in Perl, PHP, Java, or any

other language.

• Not every Python data structure can be serialized by the pickle module. The pickle protocol has changed

several times as new data types have been added to the Python language, but there are still limitations.

• As a result of these changes, there is no guarantee of compatibility between different versions of Python

itself. Newer versions of Python support the older serialization formats, but older versions of Python do not

support newer formats (since they don’t support the newer data types).

• Unless you specify otherwise, the functions in the pickle module will use the latest version of the pickle

protocol. This ensures that you have maximum flexibility in the types of data you can serialize, but it also

means that the resulting file will not be readable by older versions of Python that do not support the latest

version of the pickle protocol.

• The latest version of the pickle protocol is a binary format. Be sure to open your pickle files in binary mode, or the data will get corrupted during writing.

⁂

13.3. LOADING DATA FROM A PICKLE FILE

Now switch to your second Python Shell — i.e. not the one where you created the entry dictionary.

317

>>> shell

①

>>> entry

②

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

NameError: name 'entry' is not defined

>>> import pickle

>>> with open('entry.pickle', 'rb') as f:

③

...

entry = pickle.load(f)

④

...

>>> entry

⑤

{'comments_link': None,

'internal_id': b'\xDE\xD5\xB4\xF8',

'title': 'Dive into history, 2009 edition',

'tags': ('diveintopython', 'docbook', 'html'),

'article_link':

'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',

'published_date': time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1),

'published': True}

1. This is Python Shell #2.

2. There is no entry variable defined here. You defined an entry variable in Python Shell #1, but that’s a

completely different environment with its own state.

3. Open the entry.pickle file you created in Python Shell #1. The pickle module uses a binary data format,

so you should always open pickle files in binary mode.

4. The pickle.load() function takes a stream object, reads the serialized data from the stream, creates a new Python object, recreates the serialized data in the new Python object, and returns the new Python object.

5. Now the entry variable is a dictionary with familiar-looking keys and values.

The pickle.dump() / pickle.load() cycle results in a new data structure that is equal to the original data

structure.

318

>>> shell

①

>>> with open('entry.pickle', 'rb') as f:

②

...

entry2 = pickle.load(f)

③

...

>>> entry2 == entry

④

True

>>> entry2 is entry

⑤

False

>>> entry2['tags']

⑥

('diveintopython', 'docbook', 'html')

>>> entry2['internal_id']

b'\xDE\xD5\xB4\xF8'

1. Switch back to Python Shell #1.

2. Open the entry.pickle file.

3. Load the serialized data into a new variable, entry2.

4. Python confirms that the two dictionaries, entry and entry2, are equal. In this shell, you built entry from

the ground up, starting with an empty dictionary and manually assigning values to specific keys. You serialized

this dictionary and stored it in the entry.pickle file. Now you’ve read the serialized data from that file and

created a perfect replica of the original data structure.

5. Equality is not the same as identity. I said you’ve created a perfect replica of the original data structure, which is true. But it’s still a copy.

6. For reasons that will become clear later in this chapter, I want to point out that the value of the 'tags'

key is a tuple, and the value of the 'internal_id' key is a bytes object.

⁂

13.4. PICKLING WITHOUT A FILE

The examples in the previous section showed how to serialize a Python object directly to a file on disk. But

what if you don’t want or need a file? You can also serialize to a bytes object in memory.

319

>>> shell

>>> b = pickle.dumps(entry)

①

>>> type(b)

②

>>> entry3 = pickle.loads(b)

③

>>> entry3 == entry

④

True

1. The pickle.dumps() function (note the 's' at the end of the function name) performs the same

serialization as the pickle.dump() function. Instead of taking a stream object and writing the serialized data

to a file on disk, it simply returns the serialized data.

2. Since the pickle protocol uses a binary data format, the pickle.dumps() function returns a bytes object.

3. The pickle.loads() function (again, note the 's' at the end of the function name) performs the same

deserialization as the pickle.load() function. Instead of taking a stream object and reading the serialized

data from a file, it takes a bytes object containing serialized data, such as the one returned by the

pickle.dumps() function.

4. The end result is the same: a perfect replica of the original dictionary.

⁂

13.5. BYTES AND STRINGS REAR THEIR UGLY HEADS AGAIN

The pickle protocol has been around for many years, and it has matured as Python itself has matured. There

are now four different versions of the pickle protocol.

• Python 1.x had two pickle protocols, a text-based format (“version 0”) and a binary format (“version 1”).

• Python 2.3 introduced a new pickle protocol (“version 2”) to handle new functionality in Python class

objects. It is a binary format.

• Python 3.0 introduced another pickle protocol (“version 3”) with explicit support for bytes objects and byte

arrays. It is a binary format.

320

Oh look, the difference between bytes and strings rears its ugly head again. (If you’re surprised, you haven’t been paying attention.) What this means in practice is that, while Python 3 can read data pickled with

protocol version 2, Python 2 can not read data pickled with protocol version 3.

⁂

13.6. DEBUGGING PICKLE FILES

What does the pickle protocol look like? Let’s jump out of the Python Shell for a moment and take a look

at that entry.pickle file we created.

you@localhost:~/diveintopython3/examples$ ls -l entry.pickle

-rw-r--r-- 1 you you 358 Aug 3 13:34 entry.pickle

you@localhost:~/diveintopython3/examples$ cat entry.pickle

comments_linkqNXtagsqXdiveintopythonqXdocbookqXhtmlq?qX publishedq?

XlinkXJhttp://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition

q Xpublished_dateq

ctime