322

103: X BINUNICODE 'internal_id'

119: q BINPUT 8

广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元

121: C SHORT_BINBYTES 'ÞÕ´ø'

127: q BINPUT 9

129: X BINUNICODE 'tags'

138: q BINPUT 10

140: X BINUNICODE 'diveintopython'

159: q BINPUT 11

161: X BINUNICODE 'docbook'

173: q BINPUT 12

175: X BINUNICODE 'html'

184: q BINPUT 13

186: \x87 TUPLE3

187: q BINPUT 14

189: X BINUNICODE 'title'

199: q BINPUT 15

201: X BINUNICODE 'Dive into history, 2009 edition'

237: q BINPUT 16

239: X BINUNICODE 'article_link'

256: q BINPUT 17

258: X BINUNICODE 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'

337: q BINPUT 18

339: X BINUNICODE 'published'

353: q BINPUT 19

355: \x88 NEWTRUE

356: u SETITEMS (MARK at 5)

357: . STOP

highest protocol among opcodes = 3

The most interesting piece of information in that disassembly is on the last line, because it includes the

version of the pickle protocol with which this file was saved. There is no explicit version marker in the

pickle protocol. To determine which protocol version was used to store a pickle file, you need to look at

the markers (“opcodes”) within the pickled data and use hard-coded knowledge of which opcodes were

introduced with each version of the pickle protocol. The pickle.dis() function does exactly that, and it

323

prints the result in the last line of the disassembly output. Here is a function that returns just the version

number, without printing anything:

import pickletools

def protocol_version(file_object):

maxproto = -1

for opcode, arg, pos in pickletools.genops(file_object):

maxproto = max(maxproto, opcode.proto)

return maxproto

And here it is in action:

>>> import pickleversion

>>> with open('entry.pickle', 'rb') as f:

...

v = pickleversion.protocol_version(f)

>>> v

3

13.7. SERIALIZING PYTHON OBJECTS TO BE READ BY OTHER

LANGUAGES

The data format used by the pickle module is Python-specific. It makes no attempt to be compatible with

other programming languages. If cross-language compatibility is one of your requirements, you need to look

at other serialization formats. One such format is JSON. “JSON” stands for “JavaScript Object Notation,” but don’t let the name fool you — JSON is explicitly designed to be usable across multiple programming

languages.

Python 3 includes a json module in the standard library. Like the pickle module, the json module has

functions for serializing data structures, storing the serialized data on disk, loading serialized data from disk,

324

and unserializing the data back into a new Python object. But there are some important differences, too.

First of all, the JSON data format is text-based, not binary. RFC 4627 defines the JSON format and how different types of data must be encoded as text. For example, a boolean value is stored as either the five-character string 'false' or the four-character string 'true'. All JSON values are case-sensitive.

Second, as with any text-based format, there is the issue of whitespace. JSON allows arbitrary amounts of

whitespace (spaces, tabs, carriage returns, and line feeds) between values. This whitespace is “insignificant,”

which means that JSON encoders can add as much or as little whitespace as they like, and JSON decoders

are required to ignore the whitespace between values. This allows you to “pretty-print” your JSON data,

nicely nesting values within values at different indentation levels so you can read it in a standard browser or

text editor. Python’s json module has options for pretty-printing during encoding.

Third, there’s the perennial problem of character encoding. JSON encodes values as plain text, but as you

know, there ain’t no such thing as “plain text.” JSON must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, UTF-8), and section 3 of RFC 4627 defines how to tell which encoding is being used.

13.8. SAVING DATA TO A JSON FILE

J S O N looks remarkably like a data structure you might define manually in JavaScript. This is no accident; you

can actually use the JavaScript eval() function to “decode” JSON-serialized data. (The usual caveats about

untrusted input apply, but the point is that JSON is valid JavaScript.) As such, JSON may already look familiar to you.

325

>>> shell

1

>>> basic_entry = {}

>>> basic_entry['id'] = 256

>>> basic_entry['title'] = 'Dive into history, 2009 edition'

>>> basic_entry['tags'] = ('diveintopython', 'docbook', 'html')

>>> basic_entry['published'] = True

>>> basic_entry['comments_link'] = None

>>> import json

>>> with open('basic.json', mode='w', encoding='utf-8') as f:

...

json.dump(basic_entry, f)

1. We’re going to create a new data structure instead of re-using the existing entry data structure. Later in

this chapter, we’ll see what happens when we try to encode the more complex data structure in JSON.

2. JSON is a text-based format, which means you need to open this file in text mode and specify a character

encoding. You can never go wrong with UTF-8.

3. Like the pickle module, the json module defines a dump() function which takes a Python data structure

and a writeable stream object. The dump() function serializes the Python data structure and writes it to the

stream object. Doing this inside a with statement will ensure that the file is closed properly when we’re

done.

So what does the resulting JSON serialization look like?

you@localhost:~/diveintopython3/examples$ cat basic.json

{"published": true, "tags": ["diveintopython", "docbook", "html"], "comments_link": null,

"id": 256, "title": "Dive into history, 2009 edition"}

That’s certainly more readable than a pickle file. But JSON can contain arbitrary whitespace between values, and the json module provides an easy way to take advantage of this to create even more readable JSON

files.

326

>>> shell

1

>>> with open('basic-pretty.json', mode='w', encoding='utf-8') as f:

...

json.dump(basic_entry, f, indent=2)

1. If you pass an indent parameter to the json.dump() function, it will make the resulting JSON file more

readable, at the expense of larger file size. The indent parameter is an integer. 0 means “put each value on

its own line.” A number greater than 0 means “put each value on its own line, and use this number of

spaces to indent nested data structures.”

And this is the result:

you@localhost:~/diveintopython3/examples$ cat basic-pretty.json

{

"published": true,

"tags": [

"diveintopython",

"docbook",

"html"

],

"comments_link": null,

"id": 256,

"title": "Dive into history, 2009 edition"

}

13.9. MAPPING OF PYTHON DATATYPES TO JSON

Since JSON is not Python-specific, there are some mismatches in its coverage of Python datatypes. Some of

them are simply naming differences, but there is two important Python datatypes that are completely missing.

See if you can spot them:

327

Notes

JSON

Python 3

object

dictionary

array

list

string

string

integer

integer

real number

float

*

true

True

*

false

False

*

null

None

* All JSON values are case-sensitive.

Did you notice what was missing? Tuples & bytes! JSON has an array type, which the json module maps to

a Python list, but it does not have a separate type for “frozen arrays” (tuples). And while JSON supports

strings quite nicely, it has no support for bytes objects or byte arrays.

13.10. SERIALIZING DATATYPES UNSUPPORTED BY JSON

Even if JSON has no built-in support for bytes, that doesn’t mean you can’t serialize bytes objects. The json

module provides extensibility hooks for encoding and decoding unknown datatypes. (By “unknown,” I mean

“not defined in JSON.” Obviously the json module knows about byte arrays, but it’s constrained by the

limitations of the JSON specification.) If you want to encode bytes or other datatypes that JSON doesn’t

support natively, you need to provide custom encoders and decoders for those types.

328

>>> shell

1

>>> entry

{'comments_link': None,

'internal_id': b'\xDE\xD5\xB4\xF8',

'title': 'Dive into history, 2009 edition',

'tags': ('diveintopython', 'docbook', 'html'),

'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',

'published_date': time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1),

'published': True}

>>> import json

>>> with open('entry.json', 'w', encoding='utf-8') as f:

...

json.dump(entry, f)

...

Traceback (most recent call last):

File "<stdin>", line 5, in <module>

File "C:\Python31\lib\json\__init__.py", line 178, in dump

for chunk in iterable:

File "C:\Python31\lib\json\encoder.py", line 408, in _iterencode

for chunk in _iterencode_dict(o, _current_indent_level):

File "C:\Python31\lib\json\encoder.py", line 382, in _iterencode_dict

for chunk in chunks:

File "C:\Python31\lib\json\encoder.py", line 416, in _iterencode

o = _default(o)

File "C:\Python31\lib\json\encoder.py", line 170, in default

raise TypeError(repr(o) + " is not JSON serializable")

TypeError: b'\xDE\xD5\xB4\xF8' is not JSON serializable

1. OK, it’s time to revisit the entry data structure. This has it all: a boolean value, a None value, a string, a

tuple of strings, a bytes object, and a time structure.

2. I know I’ve said it before, but it’s worth repeating: JSON is a text-based format. Always open JSON files in

text mode with a UTF-8 character encoding.

3. Well that’s not good. What happened?

329

Here’s what happened: the json.dump() function tried to serialize the bytes object b'\xDE\xD5\xB4\xF8',

but it failed, because JSON has no support for bytes objects. However, if storing bytes is important to you,

you can define your own “mini-serialization format.”

def to_json(python_object):

if isinstance(python_object, bytes):

return {'__class__': 'bytes',

'__value__': list(python_object)}

raise TypeError(repr(python_object) + ' is not JSON serializable')

1. To define your own “mini-serialization format” for a datatype that JSON doesn’t support natively, just define

a function that takes a Python object as a parameter. This Python object will be the actual object that the

json.dump() function is unable to serialize by itself — in this case, the bytes object b'\xDE\xD5\xB4\xF8'.

2. Your custom serialization function should check the type of the Python object that the json.dump()

function passed to it. This is not strictly necessary if your function only serializes one datatype, but it makes

it crystal clear what case your function is covering, and it makes it easier to extend if you need to add

serializations for more datatypes later.

3. In this case, I’ve chosen to convert a bytes object into a dictionary. The __class__ key will hold the

original datatype (as a string, 'bytes'), and the __value__ key will hold the actual value. Of course this

can’t be a bytes object; the entire point is to convert it into something that can be serialized in JSON! A

bytes object is just a sequence of integers; each integer is somewhere in the range 0–255. We can use the

list() function to convert the bytes object into a list of integers. So b'\xDE\xD5\xB4\xF8' becomes

[222, 213, 180, 248]. (Do the math! It works! The byte \xDE in hexadecimal is 222 in decimal, \xD5 is

213, and so on.)

4. This line is important. The data structure you’re serializing may contain types that neither the built-in JSON

serializer nor your custom serializer can handle. In this case, your custom serializer must raise a TypeError

so that the json.dump() function knows that your custom serializer did not recognize the type.

That’s it; you don’t need to do anything else. In particular, this custom serialization function returns a Python

dictionary, not a string. You’re not doing the entire serializing-to-JSON yourself; you’re only doing the

converting-to-a-supported-datatype part. The json.dump() function will do the rest.

330

>>> shell

1

>>> import customserializer

>>> with open('entry.json', 'w', encoding='utf-8') as f:

...

json.dump(entry, f, default=customserializer.to_json)

...

Traceback (most recent call last):

File "<stdin>", line 9, in <module>

json.dump(entry, f, default=customserializer.to_json)

File "C:\Python31\lib\json\__init__.py", line 178, in dump

for chunk in iterable:

File "C:\Python31\lib\json\encoder.py", line 408, in _iterencode

for chunk in _iterencode_dict(o, _current_indent_level):

File "C:\Python31\lib\json\encoder.py", line 382, in _iterencode_dict

for chunk in chunks:

File "C:\Python31\lib\json\encoder.py", line 416, in _iterencode

o = _default(o)

File "/Users/pilgrim/diveintopython3/examples/customserializer.py", line 12, in to_json

raise TypeError(repr(python_object) + ' is not JSON serializable')

TypeError: time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1) is not JSON serializable 1. The customserializer module is where you just defined the to_json() function in the previous example.

2. Text mode, UTF-8 encoding, yadda yadda. (You’ll forget! I forget sometimes! And everything will work right

up until the moment that it fails, and then it will fail most spectacularly.)

3. This is the important bit: to hook your custom conversion function into the json.dump() function, pass

your function into the json.dump() function in the default parameter. (Hooray, everything in Python is an

object!)

4. OK, so it didn’t actually work. But take a look at the exception. The json.dump() function is no longer

complaining about being unable to serialize the bytes object. Now it’s complaining about a completely

different object: the time.struct_time object.

While getting a different exception might not seem like progress, it really is! It’ll just take one more tweak

to get past this.

331

import time

def to_json(python_object):

if isinstance(python_object, time.struct_time):

return {'__class__': 'time.asctime',

'__value__': time.asctime(python_object)}

if isinstance(python_object, bytes):

return {'__class__': 'bytes',

'__value__': list(python_object)}

raise TypeError(repr(python_object) + ' is not JSON serializable')

1. Adding to our existing customserializer.to_json() function, we need to check whether the Python

object (that the json.dump() function is having trouble with) is a time.struct_time.

2. If so, we’ll do something similar to the conversion we did with the bytes object: convert the

time.struct_time object to a dictionary that only contains J S O N -serializable values. In this case, the easiest

way to convert a datetime into a JSON-serializable value is to convert it to a string with the

time.asctime() function. The time.asctime() function will convert that nasty-looking time.struct_time

into the string 'Fri Mar 27 22:20:42 2009'.

With these two custom conversions, the entire entry data structure should serialize to JSON without any

further problems.

>>> shell

1

>>> with open('entry.json', 'w', encoding='utf-8') as f:

...

json.dump(entry, f, default=customserializer.to_json)

...

332

you@localhost:~/diveintopython3/examples$ ls -l example.json

-rw-r--r-- 1 you you 391 Aug 3 13:34 entry.json

you@localhost:~/diveintopython3/examples$ cat example.json

{"published_date": {"__class__": "time.asctime", "__value__": "Fri Mar 27 22:20:42 2009"},

"comments_link": null, "internal_id": {"__class__": "bytes", "__value__": [222, 213, 180, 248]},

"tags": ["diveintopython", "docbook", "html"], "title": "Dive into history, 2009 edition",

"article_link": "http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition",

"published": true}

13.11. LOADING DATA FROM A JSON FILE

Like the pickle module, the json module has a load() function which takes a stream object, reads JSON-

encoded data from it, and creates a new Python object that mirrors the JSON data structure.

333

>>> shell

2

>>> del entry

>>> entry

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

NameError: name 'entry' is not defined

>>> import json

>>> with open('entry.json', 'r', encoding='utf-8') as f:

...

entry = json.load(f)

...

>>> entry

{'comments_link': None,

'internal_id': {'__class__': 'bytes', '__value__': [222, 213, 180, 248]},

'title': 'Dive into history, 2009 edition',

'tags': ['diveintopython', 'docbook', 'html'],

'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',

'published_date': {'__class__': 'time.asctime', '__value__': 'Fri Mar 27 22:20:42 2009'},

'published': True}

1. For demonstration purposes, switch to Python Shell #2 and delete the entry data structure that you created

earlier in this chapter with the pickle module.

2. In the simplest case, the json.load() function works the same as the pickle.load() function. You pass in

a stream object and it returns a new Python object.

3. I have good news and bad news. Good news first: the json.load() function successfully read the

entry.json file you created in Python Shell #1 and created a new Python object that contained the data.

Now the bad news: it didn’t recreate the original entry data structure. The two values 'internal_id' and

'published_date' were recreated as dictionaries — specifically, the dictionaries with J S O N -compatible

values that you created in the to_json() conversion function.

json.load() doesn’t know anything about any conversion function you may have passed to json.dump().

What you need is the opposite of the to_json() function — a function that will take a custom-converted

J S O N object and convert it back to the original Python datatype.

334

# add this to customserializer.py

def from_json(json_object):

if '__class__' in json_object:

if json_object['__class__'] == 'time.asctime':

return time.strptime(json_object['__value__'])

if json_object['__class__'] == 'bytes':

return bytes(json_object['__value__'])

return json_object

1. This conversion function also takes one parameter and returns one value. But the parameter it takes is not a

string, it’s a Python object — the result of deserializing a JSON-encoded string into Python.

2. All you need to do is check whether this object contains the '__class__' key that the to_json() function

created. If so, the value of the '__class__' key will tell you how to decode the value back into the original

Python datatype.

3. To decode the time string returned by the time.asctime() function, you use the time.strptime()

function. This function takes a formatted datetime string (in a customizable format, but it defaults to the

same format that time.asctime() defaults to) and returns a time.struct_time.

4. To convert a list of integers back into a bytes object, you can use the bytes() function.

That was it; there were only two datatypes handled in the to_json() function, and now those two

datatypes are handled in the from_json() function. This is the result:

335

>>> shell

2

>>> import customserializer

>>> with open('entry.json', 'r', encoding='utf-8') as f:

...

entry = json.load(f, object_hook=customserializer.from_json)

...

>>> entry

{'comments_link': None,

'internal_id': b'\xDE\xD5\xB4\xF8',

'title': 'Dive into history, 2009 edition',

'tags': ['diveintopython', 'docbook', 'html'],

'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',

'published_date': time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1),

'published': True}

1. To hook the from_json() function into the deserialization process, pass it as the object_hook parameter

to the json.load() function. Functions that take functions; it’s so handy!

2. The entry data structure now contains an 'internal_id' key whose value is a bytes object. It also

contains a 'published_date' key whose value is a time.struct_time object.

There is one final glitch, though.

>>> shell

1

>>> import customserializer

>>> with open('entry.json', 'r', encoding='utf-8') as f:

...

entry2 = json.load(f, object_hook=customserializer.from_json)

...

>>> entry2 == entry

False

>>> entry['tags']

('diveintopython', 'docbook', 'html')

>>> entry2['tags']

['diveintopython', 'docbook', 'html']

336

1. Even after hooking the to_json() function into the serialization, and hooking the from_json() function into

the deserialization, we still haven’t recreated a perfect replica of the original data structure. Why not?

2. In the original entry data structure, the value of the 'tags' key was a tuple of three strings.

3. But in the round-tripped entry2 data structure, the value of the 'tags' key is a list of three strings. JSON

doesn’t distinguish between tuples and lists; it only has a single list-like datatype, the array, and the json

module silently converts both tuples and lists into JSON arrays during serialization. For most uses, you can

ignore the difference between tuples and lists, but it’s something to keep in mind as you work with the json

module.

13.12. FURTHER READING

☞ Many articles about the pickle module make references to cPickle. In Python 2,

there were two implementations of the pickle module, one written in pure Python

and another written in C (but still callable from Python). In Python 3, these two

modules have been consolidated, so you should always just import pickle. You may

find these articles useful, but you should ignore the now-obsolete information about

cPickle.

On pickling with the pickle module:

pickle module

pickle and cPickle — Python object serialization

Using pickle

Python persistence management

On JSON and the json module:

json — JavaScript Object Notation Serializer

JSON encoding and ecoding with custom objects in Python

On pickle extensibility:

337

Pickling class instances

Persistence of external objects

Handling stateful objects

338

CHAPTER 14. HTTP WEB SERVICES

A ruffled mind makes a restless pillow.

— Charlotte Brontë

14.1. DIVING IN

Philosophically,IcandescribeHTTPwebservicesin12words:exchangingdatawithremoteservers

using nothing but the operations of HTTP. If you want to get data from the server, use HTTP GET. If you

want to send new data to the server, use HTTP POST. Some more advanced HTTP web service APIs also

allow creating, modifying, and deleting data, using HTTP PUT and HTTP DELETE. That’s it. No registries, no

envelopes, no wrappers, no tunneling. The “verbs” built into the HTTP protocol (GET, POST, PUT, and

DELETE) map directly to application-level operations for retrieving, creating, modifying, and deleting data.

The main advantage of this approach is simplicity, and its simplicity has proven popular. Data — usually XML

or JSON — can be built and stored statically, or generated dynamically by a server-side script, and all major programming languages (including Python, of course!) include an HTTP library for downloading it. Debugging

is also easier; because each resource in an HTTP web service has a unique address (in the form of a URL),

you can load it in your web browser and immediately see the raw data.

Examples of HTTP web services:

Google Data APIs allow you to interact with a wide variety of Google services, including Blogger and

YouTube.

Flickr Services allow you to upload and download photos from Flickr.

Twitter API allows you to publish status updates on Twitter.

…and many more

Python 3 comes with two different libraries for interacting with HTTP web services:

339

http.client is a low-level library that implements RFC 2616, the HTTP protocol.

urllib.request is an abstraction layer built on top of http.client. It provides a standard API for accessing both HTTP and FTP servers, automatically follows HTTP redirects, and handles some common

forms of HTTP authentication.

So which one should you use? Neither of them. Instead, you should use httplib2, an open source third-party library that implements HTTP more fully than http.client but provides a better abstraction than

urllib.request.

To understand why httplib2 is the right choice, you first need to understand HTTP.

14.2. FEATURES OF HTTP

There are five important features which all HTTP clients should support.

14.2.1. CACHING

The most important thing to understand about any type of web service is that network access is incredibly

expensive. I don’t mean “dollars and cents” expensive (although bandwidth ain’t free). I mean that it takes an

extraordinary long time to open a connection, send a request, and retrieve a response from a remote

server. Even on the fastest broadband connection, latency (the time it takes to send a request and start

retrieving data in a response) can still be higher than you anticipated. A router misbehaves, a packet is

dropped, an intermediate proxy is under attack — there’s never a dull moment on the public internet, and there may be nothing you can do about it.

340

H T T P is designed with caching in mind. There is an

entire class of devices (called “caching proxies”) whose

only job is to sit between you and the rest of the world

and minimize network access. Your company or ISP

almost certainly maintains caching proxies, even if you’re

Cache-

unaware of them. They work because caching built into

the HTTP protocol.

Control:

Here’s a concrete example of how caching works. You

max-age

visit diveintomark.org in your browser. That page

includes a background image, wearehugh.com/m.jpg.

When your browser downloads that image, the server

means

includes the following HTTP headers:

“don't bug

HTTP/1.1 200 OK

Date: Sun, 31 May 2009 17:14:04 GMT

me until

Server: Apache

Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT

next week.”

ETag: "3075-ddc8d800"

Accept-Ranges: bytes

Content-Length: 12405

Cache-Control: max-age=31536000, publicExpires: Mon, 31 May 2010 17:14:04 GMT

Connection: close

Content-Type: image/jpeg

The Cache-Control and Expires headers tell your browser (and any caching proxies between you and the

server) that this image can be cached for up to a year. A year! And if, in the next year, you visit another

page which also includes a link to this image, your browser will load the image from its cache without

generating any network activity whatsoever.

But wait, it gets better. Let’s say your browser purges the image from your local cache for some reason.

Maybe it ran out of disk space; maybe you manually cleared the cache. Whatever. But the HTTP headers said

that this data could be cached by public caching proxies. (Technically, the important thing is what the

headers don’t say; the Cache-Control header doesn’t have the private keyword, so this data is cacheable

341

by default.) Caching proxies are designed to have tons of storage space, probably far more than your local

browser has allocated.

If your company or ISP maintain a caching proxy, the proxy may still have the image cached. When you visit

diveintomark.org again, your browser will look in its local cache for the image, but it won’t find it, so it

will make a network request to try to download it from the remote server. But if the caching proxy still has

a copy of the image, it will intercept that request and serve the image from its cache. That means that your

request will never reach the remote server; in fact, it will never leave your company’s network. That makes

for a faster download (fewer network hops) and saves your company money (less data being downloaded

from the outside world).

H T T P caching only works when everybody does their part. On one side, servers need to send the correct

headers in their response. On the other side, clients need to understand and respect those headers before

they request the same data twice. The proxies in the middle are not a panacea; they can only be as smart as

the servers and clients allow them to be.

Python’s HTTP libraries do not support caching, but httplib2 does.

14.2.2. LAST-MODIFIED CHECKING

Some data never changes, while other data changes all the time. In between, there is a vast field of data that

might have changed, but hasn’t. CNN.com’s feed is updated every few minutes, but my weblog’s feed may

not change for days or weeks at a time. In the latter case, I don’t want to tell clients to cache my feed for

weeks at a time, because then when I do actually post something, people may not read it for weeks (because

they’re respecting my cache headers which said “don’t bother checking this feed for weeks”). On the other

hand, I don’t want clients downloading my entire feed once an hour if it hasn’t changed!

342

H T T P has a solution to this, too. When you request

data for the first time, the server can send back a Last-

Modified header. This is exactly what it sounds like: the

date that the data was changed. That background image

referenced from diveintomark.org included a Last-

304: Not

Modified header.

HTTP/1.1 200 OK

Modified

Date: Sun, 31 May 2009 17:14:04 GMT

Server: Apache

means

Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT

ETag: "3075-ddc8d800"

“same shit,

Accept-Ranges: bytes

Content-Length: 12405

different

Cache-Control: max-age=31536000, public

Expires: Mon, 31 May 2010 17:14:04 GMT

day.”

Connection: close

Content-Type: image/jpeg

When you request the same data a second (or third or fourth) time, you can send an If-Modified-Since

header with your request, with the date you got back from the server last time. If the data has changed

since then, then the server ignores the If-Modified-Since header and just gives you the new data with a

200 status code. But if the data hasn’t changed since then, the server sends back a special H T T P 304 status

code, which means “this data hasn’t changed since the last time you asked for it.” You can test this on the

command line, using curl:

you@localhost:~$ curl -I -H "If-Modified-Since: Fri, 22 Aug 2008 04:28:16 GMT" http://wearehugh.com/m.jpg

HTTP/1.1 304 Not Modified

Date: Sun, 31 May 2009 18:04:39 GMT

Server: Apache

Connection: close

ETag: "3075-ddc8d800"

Expires: Mon, 31 May 2010 18:04:39 GMT

Cache-Control: max-age=31536000, public

343

Why is this an improvement? Because when the server sends a 304, it doesn’t re-send the data. All you get is

the status code. Even after your cached copy has expired, last-modified checking ensures that you won’t

download the same data twice if it hasn’t changed. (As an extra bonus, this 304 response also includes

caching headers. Proxies will keep a copy of data even after it officially “expires,” in the hopes that the data

hasn’t really changed and the next request responds with a 304 status code and updated cache information.)

Python’s HTTP libraries do not support last-modified date checking, but httplib2 does.

14.2.3. ETAG CHECKING

ETags are an alternate way to accomplish the same thing as the last-modified checking. With Etags, the server sends a hash code in an ETag header along with the data you requested. (Exactly how this hash is

determined is entirely up to the server. The only requirement is that it changes when the data changes.)

That background image referenced from diveintomark.org had an ETag header.

HTTP/1.1 200 OK

Date: Sun, 31 May 2009 17:14:04 GMT

Server: Apache

Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT

ETag: "3075-ddc8d800"

Accept-Ranges: bytes

Content-Length: 12405

Cache-Control: max-age=31536000, public

Expires: Mon, 31 May 2010 17:14:04 GMT

Connection: close

Content-Type: image/jpeg

344

The second time you request the same data, you include

the ETag hash in an If-None-Match header of your

request. If the data hasn’t changed, the server will send

you back a 304 status code. As with the last-modified

date checking, the server sends back only the 304 status

ETag means

code; it doesn’t send you the same data a second time.

By including the ETag hash in your second request,

“there’s

you’re telling the server that there’s no need to re-send

the same data if it still matches this hash, since you still

nothing new

have the data from the last time.

under the

Again with the curl:

sun.”

you@localhost:~$ curl -I -H "If-None-Match: \"3075-ddc8d800\"" http://wearehugh.com/m.jpg

HTTP/1.1 304 Not Modified

Date: Sun, 31 May 2009 18:04:39 GMT

Server: Apache

Connection: close

ETag: "3075-ddc8d800"

Expires: Mon, 31 May 2010 18:04:39 GMT

Cache-Control: max-age=31536000, public

1. ETags are commonly enclosed in quotation marks, but the quotation marks are part of the value. That means

you need to send the quotation marks back to the server in the If-None-Match header.

Python’s HTTP libraries do not support ETags, but httplib2 does.

345

14.2.4. COMPRESSION

When you talk about HTTP web services, you’re almost always talking about moving text-based data back

and forth over the wire. Maybe it’s XML, maybe it’s JSON, maybe it’s just plain text. Regardless of the format, text compresses well. The example feed in the XML chapter is 3070 bytes uncompressed, but would be 941 bytes after gzip compression. That’s just 30% of the original size!

H T T P supports several compression algorithms. The two most common types are gzip and deflate. When you request a resource over HTTP, you can ask the server to send it in compressed format. You include an

Accept-encoding header in your request that lists which compression algorithms you support. If the server

supports any of the same algorithms, it will send you back compressed data (with a Content-encoding

header that tells you which algorithm it used). Then it’s up to you to decompress the data.

☞ Important tip for server-side developers: make sure that the compressed version of a

resource has a different Etag than the uncompressed version. Otherwise, caching

proxies will get confused and may serve the compressed version to clients that can’t

handle it. Read the discussion of Apache bug 39727 for more details on this subtle issue.

Python’s HTTP libraries do not support compression, but httplib2 does.

14.2.5. REDIRECTS

Cool URIs don’t change, but many URIs are seriously uncool. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at http://example.com/index.xml might

be moved to http://example.com/xml/atom.xml. Or an entire domain might move, as an organization

expands and reorganizes; http://www.example.com/index.xml becomes http://server-

farm-1.example.com/index.xml.

346

Every time you request any kind of resource from an

H T T P server, the server includes a status code in its

response. Status code 200 means “everything’s normal,

here’s the page you asked for”. Status code 404 means

“page not found”. (You’ve probably seen 404 errors

Location

while browsing the web.) Status codes in the 300’s

indicate some form of redirection.

means “look

H T T P has several different ways of signifying that a

over there!”

resource has moved. The two most common techiques

are status codes 302 and 301. Status code 302 is a

temporary redirect; it means “oops, that got moved over

here temporarily” (and then gives the temporary address

in a Location header). Status code 301 is a permanent redirect; it means “oops, that got moved permanently”

(and then gives the new address in a Location header). If you get a 302 status code and a new address, the

H T T P specification says you should use the new address to get what you asked for, but the next time you

want to access the same resource, you should retry the old address. But if you get a 301 status code and a

new address, you’re supposed to use the new address from then on.

The urllib.request module automatically “follow” redirects when it receives the appropriate status code

from the HTTP server, but it doesn’t tell you that it did so. You’ll end up getting data you asked for, but

you’ll never know that the underlying library “helpfully” followed a redirect for you. So you’ll continue

pounding away at the old address, and each time you’ll get redirected to the new address, and each time the

urllib.request module will “helpfully” follow the redirect. In other words, it treats permanent redirects

the same as temporary redirects. That means two round trips instead of one, which is bad for the server

and bad for you.

httplib2 handles permanent redirects for you. Not only will it tell you that a permanent redirect occurred,

it will keep track of them locally and automatically rewrite redirected URLs before requesting them.

347

14.3. HOW NOT TO FETCH DATA OVER HTTP

Let’s say you want to download a resource over HTTP, such as an Atom feed. Being a feed, you’re not just going to download it once; you’re going to download it over and over again. (Most feed readers will check

for changes once an hour.) Let’s do it the quick-and-dirty way first, and then see how you can do better.

>>> import urllib.request

>>> a_url = 'http://diveintopython3.org/examples/feed.xml'

>>> data = urllib.request.urlopen(a_url).read()

>>> type(data)

<class 'bytes'>

>>> print(data)

<?xml version='1.0' encoding='utf-8'?>

<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>

<title>dive into mark</title>

<subtitle>currently between addictions</subtitle>

<id>tag:diveintomark.org,2001-07-29:/</id>

<updated>2009-03-27T21:56:07Z</updated>

<link rel='alternate' type='text/html' href='http://diveintomark.org/'/>

1. Downloading anything over HTTP is incredibly easy in Python; in fact, it’s a one-liner. The urllib.request

module has a handy urlopen() function that takes the address of the page you want, and returns a file-like

object that you can just read() from to get the full contents of the page. It just can’t get any easier.

2. The urlopen().read() method always returns a bytes object, not a string. Remember, bytes are bytes; characters are an abstraction. HTTP servers don’t deal in abstractions. If you request a resource, you get

bytes. If you want it as a string, you’ll need to determine the character encoding and explicitly convert it to a string.

So what’s wrong with this? For a quick one-off during testing or development, there’s nothing wrong with it.

I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same

technique works for any web page. But once you start thinking in terms of a web service that you want to

access on a regular basis ( e.g. requesting this feed once an hour), then you’re being inefficient, and you’re

being rude.

348

14.4. WHAT’S ON THE WIRE?

To see why this is inefficient and rude, let’s turn on the debugging features of Python’s HTTP library and see

what’s being sent “on the wire” ( i.e. over the network).

>>> from http.client import HTTPConnection

>>> HTTPConnection.debuglevel = 1

>>> from urllib.request import urlopen

>>> response = urlopen('http://diveintopython3.org/examples/feed.xml')

send: b'GET /examples/feed.xml HTTP/1.1

Host: diveintopython3.org

Accept-Encoding: identity

User-Agent: Python-urllib/3.1'

Connection: close

reply: 'HTTP/1.1 200 OK'

…further debugging information omitted…

1. As I mentioned at the beginning of the chapter, urllib.request relies on another standard Python library,

http.client. Normally you don’t need to touch http.client directly. (The urllib.request module

imports it automatically.) But we import it here so we can toggle the debugging flag on the HTTPConnection

class that urllib.request uses to connect to the HTTP server.

2. Now that the debugging flag is set, information on the HTTP request and response is printed out in real

time. As you can see, when you request the Atom feed, the urllib.request module sends five lines to the

server.

3. The first line specifies the HTTP verb you’re using, and the path of the resource (minus the domain name).

4. The second line specifies the domain name from which we’re requesting this feed.

5. The third line specifies the compression algorithms that the client supports. As I mentioned earlier,

urllib.request does not support compression by default.

6. The fourth line specifies the name of the library that is making the request. By default, this is Python-urllib

plus a version number. Both urllib.request and httplib2 support changing the user agent, simply by

adding a User-Agent header to the request (which will override the default value).

349

Now let’s look at what the server sent back in its

response.

We’re

downloading

3070 bytes

when we

could have

just

downloaded

941.

350

# continued from previous example

>>> print(response.headers.as_string())

Date: Sun, 31 May 2009 19:23:06 GMT

Server: Apache

Last-Modified: Sun, 31 May 2009 06:39:55 GMT

ETag: "bfe-93d9c4c0"

Accept-Ranges: bytes

Content-Length: 3070

Cache-Control: max-age=86400

Expires: Mon, 01 Jun 2009 19:23:06 GMT

Vary: Accept-Encoding

Connection: close

Content-Type: application/xml

>>> data = response.read()

>>> len(data)

3070

1. The response returned from the urllib.request.urlopen() function contains all the HTTP headers the

server sent back. It also contains methods to download the actual data; we’ll get to that in a minute.

2. The server tells you when it handled your request.

3. This response includes a Last-Modified header.

4. This response includes an ETag header.

5. The data is 3070 bytes long. Notice what isn’t here: a Content-encoding header. Your request stated that

you only accept uncompressed data (Accept-encoding: identity), and sure enough, this response contains

uncompressed data.

6. This response includes caching headers that state that this feed can be cached for up to 24 hours (86400

seconds).

7. And finally, download the actual data by calling response.read(). As you can tell from the len() function,

this downloads all 3070 bytes at once.

As you can see, this code is already inefficient: it asked for (and received) uncompressed data. I know for a

fact that this server supports gzip compression, but HTTP compression is opt-in. We didn’t ask for it, so we didn’t get it. That means we’re downloading 3070 bytes when we could have just downloaded 941. Bad dog,

no biscuit.

351

But wait, it gets worse! To see just how inefficient this code is, let’s request the same feed a second time.

# continued from the previous example

>>> response2 = urlopen('http://diveintopython3.org/examples/feed.xml')

send: b'GET /examples/feed.xml HTTP/1.1

Host: diveintopython3.org

Accept-Encoding: identity

User-Agent: Python-urllib/3.1'

Connection: close

reply: 'HTTP/1.1 200 OK'

…further debugging information omitted…

Notice anything peculiar about this request? It hasn’t changed! It’s exactly the same as the first request. No

sign of If-Modified-Since headers. No sign of If-None-Match headers. No respect for the caching headers. Still no compression.

And what happens when you do the same thing twice? You get the same response. Twice.

352

# continued from the previous example

>>> print(response2.headers.as_string())

Date: Mon, 01 Jun 2009 03:58:00 GMT

Server: Apache

Last-Modified: Sun, 31 May 2009 22:51:11 GMT

ETag: "bfe-255ef5c0"

Accept-Ranges: bytes

Content-Length: 3070

Cache-Control: max-age=86400

Expires: Tue, 02 Jun 2009 03:58:00 GMT

Vary: Accept-Encoding

Connection: close

Content-Type: application/xml

>>> data2 = response2.read()

>>> len(data2)

3070

>>> data2 == data

True

1. The server is still sending the same array of “smart” headers: Cache-Control and Expires to allow caching,

Last-Modified and ETag to enable “not-modified” tracking. Even the Vary: Accept-Encoding header hints

that the server would support compression, if only you would ask for it. But you didn’t.

2. Once again, fetching this data downloads the whole 3070 bytes…

3. …the exact same 3070 bytes you downloaded last time.

H T T P is designed to work better than this. urllib speaks H T T P like I speak Spanish — enough to get by in

a jam, but not enough to hold a conversation. HTTP is a conversation. It’s time to upgrade to a library that

speaks HTTP fluently.

353

14.5. INTRODUCING httplib2

Before you can use httplib2, you’ll need to install it. Visit code.google.com/p/httplib2/ and download the latest version. httplib2 is available for Python 2.x and Python 3.x; make sure you get the Python 3

version, named something like httplib2-python3-0.5.0.zip.

Unzip the archive, open a terminal window, and go to the newly created httplib2 directory. On Windows,

open the Start menu, select Run..., type cmd.exe and press ENTER.

354

c:\Users\pilgrim\Downloads> dir

Volume in drive C has no label.

Volume Serial Number is DED5-B4F8

Directory of c:\Users\pilgrim\Downloads

07/28/2009 12:36 PM <DIR> .

07/28/2009 12:36 PM <DIR> ..

07/28/2009 12:36 PM <DIR> httplib2-python3-0.5.0

07/28/2009 12:33 PM 18,997 httplib2-python3-0.5.0.zip

1 File(s) 18,997 bytes

3 Dir(s) 61,496,684,544 bytes free

c:\Users\pilgrim\Downloads> cd httplib2-python3-0.5.0

c:\Users\pilgrim\Downloads\httplib2-python3-0.5.0> c:\python31\python.exe setup.py install

running install

running build

running build_py

running install_lib

creating c:\python31\Lib\site-packages\httplib2

copying build\lib\httplib2\iri2uri.py -> c:\python31\Lib\site-packages\httplib2

copying build\lib\httplib2\__init__.py -> c:\python31\Lib\site-packages\httplib2

byte-compiling c:\python31\Lib\site-packages\httplib2\iri2uri.py to iri2uri.pyc

byte-compiling c:\python31\Lib\site-packages\httplib2\__init__.py to __init__.pyc

running install_egg_info

Writing c:\python31\Lib\site-packages\httplib2-python3_0.5.0-py3.1.egg-info

On Mac OS X, run the Terminal.app application in your /Applications/Utilities/ folder. On Linux, run

the Terminal application, which is usually in your Applications menu under Accessories or System.

355

you@localhost:~/Desktop$ unzip httplib2-python3-0.5.0.zip

Archive: httplib2-python3-0.5.0.zip

inflating: httplib2-python3-0.5.0/README

inflating: httplib2-python3-0.5.0/setup.py

inflating: httplib2-python3-0.5.0/PKG-INFO

inflating: httplib2-python3-0.5.0/httplib2/__init__.py

inflating: httplib2-python3-0.5.0/httplib2/iri2uri.py

you@localhost:~/Desktop$ cd httplib2-python3-0.5.0/

you@localhost:~/Desktop/httplib2-python3-0.5.0$ sudo python3 setup.py install

running install

running build

running build_py

creating build

creating build/lib.linux-x86_64-3.1

creating build/lib.linux-x86_64-3.1/httplib2

copying httplib2/iri2uri.py -> build/lib.linux-x86_64-3.1/httplib2

copying httplib2/__init__.py -> build/lib.linux-x86_64-3.1/httplib2

running install_lib

creating /usr/local/lib/python3.1/dist-packages/httplib2

copying build/lib.linux-x86_64-3.1/httplib2/iri2uri.py -> /usr/local/lib/python3.1/dist-packages/httplib2

copying build/lib.linux-x86_64-3.1/httplib2/__init__.py -> /usr/local/lib/python3.1/dist-packages/httplib2

byte-compiling /usr/local/lib/python3.1/dist-packages/httplib2/iri2uri.py to iri2uri.pyc

byte-compiling /usr/local/lib/python3.1/dist-packages/httplib2/__init__.py to __init__.pyc

running install_egg_info

Writing /usr/local/lib/python3.1/dist-packages/httplib2-python3_0.5.0.egg-info

To use httplib2, create an instance of the httplib2.Http class.

356

>>> import httplib2

>>> h = httplib2.Http('.cache')

>>> response, content = h.request('http://diveintopython3.org/examples/feed.xml')

>>> response.status

200

>>> content[:52]

b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="

>>> len(content)

3070

1. The primary interface to httplib2 is the Http object. For reasons you’ll see in the next section, you should

always pass a directory name when you create an Http object. The directory does not need to exist;

httplib2 will create it if necessary.

2. Once you have an Http object, retrieving data is as simple as calling the request() method with the

address of the data you want. This will issue an HTTP GET request for that URL. (Later in this chapter, you’ll

see how to issue other HTTP requests, like POST.)

3. The request() method returns two values. The first is an httplib2.Response object, which contains all

the HTTP headers the server returned. For example, a status code of 200 indicates that the request was

successful.

4. The content variable contains the actual data that was returned by the HTTP server. The data is returned

as a bytes object, not a string. If you want it as a string, you’ll need to determine the character encoding

and convert it yourself.

☞ You probably only need one httplib2.Http object. There are valid reasons for

creating more than one, but you should only do so if you know why you need them.

“I need to request data from two different URLs” is not a valid reason. Re-use the

Http object and just call the request() method twice.

357

阅读 ‧ 电子书库

14.5.1. A SHORT DIGRESSION TO EXPLAIN WHY httplib2 RETURNS BYTES INSTEAD OF

STRINGS

Bytes. Strings. What a pain. Why can’t httplib2 “just” do the conversion for you? Well, it’s complicated,

because the rules for determining the character encoding are specific to what kind of resource you’re

requesting. How could httplib2 know what kind of resource you’re requesting? It’s usually listed in the

Content-Type H T T P header, but that’s an optional feature of H T T P and not all H T T P servers include it. If

that header is not included in the HTTP response, it’s left up to the client to guess. (This is commonly called

“content sniffing,” and it’s never perfect.)

If you know what sort of resource you’re expecting (an XML document in this case), perhaps you could

“just” pass the returned bytes object to the xml.etree.ElementTree.parse() function. That’ll work as long as the XML document includes information on its own character encoding (as this one does), but that’s

an optional feature and not all XML documents do that. If an XML document doesn’t include encoding

information, the client is supposed to look at the enclosing transport — i.e. the Content-Type HTTP header,

which can include a charset parameter.

But it’s worse than that. Now character encoding information can be in two

places: within the XML document itself, and within the Content-Type HTTP

header. If the information is in both places, which one wins? According to RFC

3023 (I swear I am not making this up), if the media type given in the ContentType H T T P header is application/xml, application/xml-dtd, application/

xml-external-parsed-entity, or any one of the subtypes of application/xml

such as application/atom+xml or application/rss+xml or even application/

rdf+xml, then the encoding is

1. the encoding given in the charset parameter of the Content-Type HTTP header, or

2. the encoding given in the encoding attribute of the XML declaration within the document, or

3. UTF-8

On the other hand, if the media type given in the Content-Type HTTP header is text/xml, text/xml-

external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the

X M L declaration within the document is ignored completely, and the encoding is

358

1. the encoding given in the charset parameter of the Content-Type HTTP header, or

2. us-ascii

And that’s just for XML documents. For HTML documents, web browsers have constructed such byzantine

rules for content-sniffing [PDF] that we’re still trying to figure them all out.

Patches welcome.”

14.5.2. HOW httplib2 HANDLES CACHING

Remember in the previous section when I said you should always create an httplib2.Http object with a

directory name? Caching is the reason.

# continued from the previous example

>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml')

>>> response2.status

200

>>> content2[:52]

b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="

>>> len(content2)

3070

1. This shouldn’t be terribly surprising. It’s the same thing you did last time, except you’re putting the result

into two new variables.

2. The HTTP status is once again 200, just like last time.

3. The downloaded content is the same as last time, too.

So… who cares? Quit your Python interactive shell and relaunch it with a new session, and I’ll show you.

359

# NOT continued from previous example!

# Please exit out of the interactive shell

# and launch a new one.

>>> import httplib2

>>> httplib2.debuglevel = 1

>>> h = httplib2.Http('.cache')

>>> response, content = h.request('http://diveintopython3.org/examples/feed.xml')

>>> len(content)

3070

>>> response.status

200

>>> response.fromcache

True

1. Let’s turn on debugging and see what’s on the wire. This is the httplib2 equivalent of turning on debugging in http.client. httplib2 will print all the data being sent to the server and some key information being

sent back.

2. Create an httplib2.Http object with the same directory name as before.

3. Request the same URL as before. Nothing appears to happen. More precisely, nothing gets sent to the server,

and nothing gets returned from the server. There is absolutely no network activity whatsoever.

4. Yet we did “receive” some data — in fact, we received all of it.

5. We also “received” an HTTP status code indicating that the “request” was successful.

6. Here’s the rub: this “response” was generated from httplib2’s local cache. That directory name you passed

in when you created the httplib2.Http object — that directory holds httplib2’s cache of all the

operations it’s ever performed.

360

☞ If you want to turn on httplib2 debugging,

you need to set a module-level constant

(httplib2.debuglevel), then create a new

httplib2.Http object. If you want to turn

off debugging, you need to change the same

What’s on

module-level constant, then create a new

httplib2.Http object.

the wire?

Absolutely

You previously requested the data at this URL. That

request was successful (status: 200). That response

nothing.

included not only the feed data, but also a set of caching

headers that told anyone who was listening that they

could cache this resource for up to 24 hours (Cache-

Control: max-age=86400, which is 24 hours measured

in seconds). httplib2 understand and respects those caching headers, and it stored the previous response in

the .cache directory (which you passed in when you create the Http object). That cache hasn’t expired yet,

so the second time you request the data at this URL, httplib2 simply returns the cached result without

ever hitting the network.

I say “simply,” but obviously there is a lot of complexity hidden behind that simplicity. httplib2 handles

H T T P caching automatically and by default. If for some reason you need to know whether a response came

from the cache, you can check response.fromcache. Otherwise, it Just Works.

Now, suppose you have data cached, but you want to bypass the cache and re-request it from the remote

server. Browsers sometimes do this if the user specifically requests it. For example, pressing F5 refreshes the

current page, but pressing Ctrl+F5 bypasses the cache and re-requests the current page from the remote

server. You might think “oh, I’ll just delete the data from my local cache, then request it again.” You could

do that, but remember that there may be more parties involved than just you and the remote server. What

about those intermediate proxy servers? They’re completely beyond your control, and they may still have

that data cached, and will happily return it to you because (as far as they are concerned) their cache is still

valid.

361

Instead of manipulating your local cache and hoping for the best, you should use the features of HTTP to

ensure that your request actually reaches the remote server.

# continued from the previous example

>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed.xml',

...

headers={'cache-control':'no-cache'})

connect: (diveintopython3.org, 80)

send: b'GET /examples/feed.xml HTTP/1.1

Host: diveintopython3.org

user-agent: Python-httplib2/$Rev: 259 $

accept-encoding: deflate, gzip

cache-control: no-cache'

reply: 'HTTP/1.1 200 OK'

…further debugging information omitted…

>>> response2.status

200

>>> response2.fromcache

False

>>> print(dict(response2.items()))

{'status': '200',

'content-length': '3070',

'content-location': 'http://diveintopython3.org/examples/feed.xml',

'accept-ranges': 'bytes',

'expires': 'Wed, 03 Jun 2009 00:40:26 GMT',

'vary': 'Accept-Encoding',

'server': 'Apache',

'last-modified': 'Sun, 31 May 2009 22:51:11 GMT',

'connection': 'close',

'-content-encoding': 'gzip',

'etag': '"bfe-255ef5c0"',

'cache-control': 'max-age=86400',

'date': 'Tue, 02 Jun 2009 00:40:26 GMT',

'content-type': 'application/xml'}

362

1. httplib2 allows you to add arbitrary HTTP headers to any outgoing request. In order to bypass all caches

(not just your local disk cache, but also any caching proxies between you and the remote server), add a no-

cache header in the headers dictionary.

2. Now you see httplib2 initiating a network request. httplib2 understands and respects caching headers in

both directions — as part of the incoming response and as part of the outgoing request. It noticed that you

added the no-cache header, so it bypassed its local cache altogether and then had no choice but to hit the

network to request the data.

3. This response was not generated from your local cache. You knew that, of course, because you saw the

debugging information on the outgoing request. But it’s nice to have that programmatically verified.

4. The request succeeded; you downloaded the entire feed again from the remote server. Of course, the

server also sent back a full complement of HTTP headers along with the feed data. That includes caching

headers, which httplib2 uses to update its local cache, in the hopes of avoiding network access the next

time you request this feed. Everything about HTTP caching is designed to maximize cache hits and minimize

network access. Even though you bypassed the cache this time, the remote server would really appreciate it

if you would cache the result for next time.

14.5.3. HOW httplib2 HANDLES Last-Modified AND ETag HEADERS

The Cache-Control and Expires caching headers are called freshness indicators. They tell caches in no uncertain terms that you can completely avoid all network access until the cache expires. And that’s exactly

the behavior you saw in the previous section: given a freshness indicator, httplib2 does not generate a single byte of network activity to serve up cached data (unless you explicitly bypass the cache, of course).

But what about the case where the data might have changed, but hasn’t? HTTP defines Last-Modified and

Etag headers for this purpose. These headers are called validators. If the local cache is no longer fresh, a client can send the validators with the next request to see if the data has actually changed. If the data hasn’t

changed, the server sends back a 304 status code and no data. So there’s still a round-trip over the network,

but you end up downloading fewer bytes.

363

>>> import httplib2

>>> httplib2.debuglevel = 1

>>> h = httplib2.Http('.cache')

>>> response, content = h.request('http://diveintopython3.org/')

connect: (diveintopython3.org, 80)

send: b'GET / HTTP/1.1

Host: diveintopython3.org

accept-encoding: deflate, gzip

user-agent: Python-httplib2/$Rev: 259 $'

reply: 'HTTP/1.1 200 OK'

>>> print(dict(response.items()))

{'-content-encoding': 'gzip',

'accept-ranges': 'bytes',

'connection': 'close',

'content-length': '6657',

'content-location': 'http://diveintopython3.org/',

'content-type': 'text/html',

'date': 'Tue, 02 Jun 2009 03:26:54 GMT',

'etag': '"7f806d-1a01-9fb97900"', 'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',

'server': 'Apache',

'status': '200',

'vary': 'Accept-Encoding,User-Agent'}

>>> len(content)

6657

1. Instead of the feed, this time we’re going to download the site’s home page, which is HTML. Since this is the

first time you’ve ever requested this page, httplib2 has little to work with, and it sends out a minimum of

headers with the request.

2. The response contains a multitude of HTTP headers… but no caching information. However, it does include

both an ETag and Last-Modified header.

3. At the time I constructed this example, this page was 6657 bytes. It’s probably changed since then, but don’t

worry about it.

364

# continued from the previous example

>>> response, content = h.request('http://diveintopython3.org/')

connect: (diveintopython3.org, 80)

send: b'GET / HTTP/1.1

Host: diveintopython3.org

if-none-match: "7f806d-1a01-9fb97900"

if-modified-since: Tue, 02 Jun 2009 02:51:48 GMT

accept-encoding: deflate, gzip

user-agent: Python-httplib2/$Rev: 259 $'

reply: 'HTTP/1.1 304 Not Modified'

>>> response.fromcache

True

>>> response.status

200

>>> response.dict['status']

'304'

>>> len(content)

6657

1. You request the same page again, with the same Http object (and the same local cache).

2. httplib2 sends the ETag validator back to the server in the If-None-Match header.

3. httplib2 also sends the Last-Modified validator back to the server in the If-Modified-Since header.

4. The server looked at these validators, looked at the page you requested, and determined that the page has

not changed since you last requested it, so it sends back a 304 status code and no data.

5. Back on the client, httplib2 notices the 304 status code and loads the content of the page from its cache.

6. This might be a bit confusing. There are really two status codes — 304 (returned from the server this time,

which caused httplib2 to look in its cache), and 200 (returned from the server last time, and stored in

httplib2’s cache along with the page data). response.status returns the status from the cache.

7. If you want the raw status code returned from the server, you can get that by looking in response.dict,

which is a dictionary of the actual headers returned from the server.

8. However, you still get the data in the content variable. Generally, you don’t need to know why a response

was served from the cache. (You may not even care that it was served from the cache at all, and that’s fine

too. httplib2 is smart enough to let you act dumb.) By the time the request() method returns to the

caller, httplib2 has already updated its cache and returned the data to you.

365

14.5.4. HOW http2lib HANDLES COMPRESSION

H T T P supports several types of compression; the two

most common types are gzip and deflate. httplib2

supports both of these.

“We have

both kinds

of music,

country

AND

western.”

366

>>> response, content = h.request('http://diveintopython3.org/')

connect: (diveintopython3.org, 80)

send: b'GET / HTTP/1.1

Host: diveintopython3.org

accept-encoding: deflate, gzip

user-agent: Python-httplib2/$Rev: 259 $'

reply: 'HTTP/1.1 200 OK'

>>> print(dict(response.items()))

{'-content-encoding': 'gzip',

'accept-ranges': 'bytes',

'connection': 'close',

'content-length': '6657',

'content-location': 'http://diveintopython3.org/',

'content-type': 'text/html',

'date': 'Tue, 02 Jun 2009 03:26:54 GMT',

'etag': '"7f806d-1a01-9fb97900"',

'last-modified': 'Tue, 02 Jun 2009 02:51:48 GMT',

'server': 'Apache',

'status': '304',

'vary': 'Accept-Encoding,User-Agent'}

1. Every time httplib2 sends a request, it includes an Accept-Encoding header to tell the server that it can

handle either deflate or gzip compression.

2. In this case, the server has responded with a gzip-compressed payload. By the time the request() method

returns, httplib2 has already decompressed the body of the response and placed it in the content variable.

If you’re curious about whether or not the response was compressed, you can check response['-content-

encoding']; otherwise, don’t worry about it.

14.5.5. HOW httplib2 HANDLES REDIRECTS

H T T P defines two kinds of redirects: temporary and permanent. There’s nothing special to do with temporary redirects except follow them, which httplib2 does automatically.

367

>>> import httplib2

>>> httplib2.debuglevel = 1

>>> h = httplib2.Http('.cache')

>>> response, content = h.request('http://diveintopython3.org/examples/feed-302.xml')

connect: (diveintopython3.org, 80)

send: b'GET /examples/feed-302.xml HTTP/1.1

Host: diveintopython3.org

accept-encoding: deflate, gzip

user-agent: Python-httplib2/$Rev: 259 $'

reply: 'HTTP/1.1 302 Found'

send: b'GET /examples/feed.xml HTTP/1.1

Host: diveintopython3.org

accept-encoding: deflate, gzip

user-agent: Python-httplib2/$Rev: 259 $'

reply: 'HTTP/1.1 200 OK'

1. There is no feed at this URL. I’ve set up my server to issue a temporary redirect to the correct address.

2. There’s the request.

3. And there’s the response: 302 Found. Not shown here, this response also includes a Location header that

points to the real URL.

4. httplib2 immediately turns around and “follows” the redirect by issuing another request for the URL given

in the Location header: http://diveintopython3.org/examples/feed.xml

“Following” a redirect is nothing more than this example shows. httplib2 sends a request for the URL you

asked for. The server comes back with a response that says “No no, look over there instead.” httplib2

sends another request for the new URL.

368

# continued from the previous example

>>> response

{'status': '200',

'content-length': '3070',

'content-location': 'http://diveintopython3.org/examples/feed.xml',

'accept-ranges': 'bytes',

'expires': 'Thu, 04 Jun 2009 02:21:41 GMT',

'vary': 'Accept-Encoding',

'server': 'Apache',

'last-modified': 'Wed, 03 Jun 2009 02:20:15 GMT',

'connection': 'close',

'-content-encoding': 'gzip',

'etag': '"bfe-4cbbf5c0"',

'cache-control': 'max-age=86400',

'date': 'Wed, 03 Jun 2009 02:21:41 GMT',

'content-type': 'application/xml'}

1. The response you get back from this single call to the request() method is the response from the final

U R L .

2. httplib2 adds the final URL to the response dictionary, as content-location. This is not a header that

came from the server; it’s specific to httplib2.

3. Apropos of nothing, this feed is compressed.

4. And cacheable. (This is important, as you’ll see in a minute.)

The response you get back gives you information about the final URL. What if you want more information

about the intermediate URLs, the ones that eventually redirected to the final URL? httplib2 lets you do

that, too.

369

# continued from the previous example

>>> response.previous

{'status': '302',

'content-length': '228',

'content-location': 'http://diveintopython3.org/examples/feed-302.xml',

'expires': 'Thu, 04 Jun 2009 02:21:41 GMT',

'server': 'Apache',

'connection': 'close',

'location': 'http://diveintopython3.org/examples/feed.xml',

'cache-control': 'max-age=86400',

'date': 'Wed, 03 Jun 2009 02:21:41 GMT',

'content-type': 'text/html; charset=iso-8859-1'}

>>> type(response)

<class 'httplib2.Response'>

>>> type(response.previous)

<class 'httplib2.Response'>

>>> response.previous.previous

>>>

1. The response.previous attribute holds a reference to the previous response object that httplib2 followed

to get to the current response object.

2. Both response and response.previous are httplib2.Response objects.

3. That means you can check response.previous.previous to follow the redirect chain backwards even

further. (Scenario: one URL redirects to a second URL which redirects to a third URL. It could happen!) In

this case, we’ve already reached the beginning of the redirect chain, so the attribute is None.

What happens if you request the same URL again?

370

# continued from the previous example

>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed-302.xml')

connect: (diveintopython3.org, 80)

send: b'GET /examples/feed-302.xml HTTP/1.1

Host: diveintopython3.org

accept-encoding: deflate, gzip

user-agent: Python-httplib2/$Rev: 259 $'

reply: 'HTTP/1.1 302 Found'

>>> content2 == content

True

1. Same URL, same httplib2.Http object (and therefore the same cache).

2. The 302 response was not cached, so httplib2 sends another request for the same URL.

3. Once again, the server responds with a 302. But notice what didn’t happen: there wasn’t ever a second

request for the final URL, http://diveintopython3.org/examples/feed.xml. That response was cached

(remember the Cache-Control header that you saw in the previous example). Once httplib2 received the

302 Found code, it checked its cache before issuing another request. The cache contained a fresh copy of

http://diveintopython3.org/examples/feed.xml, so there was no need to re-request it.

4. By the time the request() method returns, it has read the feed data from the cache and returned it. Of

course, it’s the same as the data you received last time.

In other words, you don’t have to do anything special for temporary redirects. httplib2 will follow them

automatically, and the fact that one URL redirects to another has no bearing on httplib2’s support for

compression, caching, ETags, or any of the other features of HTTP.

Permanent redirects are just as simple.

371

# continued from the previous example

>>> response, content = h.request('http://diveintopython3.org/examples/feed-301.xml')

connect: (diveintopython3.org, 80)

send: b'GET /examples/feed-301.xml HTTP/1.1

Host: diveintopython3.org

accept-encoding: deflate, gzip

user-agent: Python-httplib2/$Rev: 259 $'

reply: 'HTTP/1.1 301 Moved Permanently'

>>> response.fromcache

True

1. Once again, this URL doesn’t really exist. I’ve set up my server to issue a permanent redirect to

http://diveintopython3.org/examples/feed.xml.

2. And here it is: status code 301. But again, notice what didn’t happen: there was no request to the redirect

U R L . Why not? Because it’s already cached locally.

3. httplib2 “followed” the redirect right into its cache.

But wait! There’s more!

# continued from the previous example

>>> response2, content2 = h.request('http://diveintopython3.org/examples/feed-301.xml')

>>> response2.fromcache

True

>>> content2 == content

True

1. Here’s the difference between temporary and permanent redirects: once httplib2 follows a permanent

redirect, all further requests for that URL will transparently be rewritten to the target URL without hitting the

network for the original URL. Remember, debugging is still turned on, yet there is no output of network

activity whatsoever.

2. Yep, this response was retrieved from the local cache.

3. Yep, you got the entire feed (from the cache).

H T T P . It works.

372

14.6. BEYOND HTTP GET

H T T P web services are not limited to GET requests. What if you want to create something new? Whenever

you post a comment on a discussion forum, update your weblog, publish your status on a microblogging

service like Twitter or Identi.ca, you’re probably already using HTTP POST.

Both Twitter and Identi.ca both offer a simple HTTP-based API for publishing and updating your status in 140

characters or less. Let’s look at Identi.ca’s API documentation for updating your status: Identi.ca REST API Method: statuses/update

Updates the authenticating user’s status. Requires the status parameter specified below.

Request must be a POST.

U R L

https://identi.ca/api/statuses/update.format

Formats

xml, json, rss, atom

H T T P Method(s)

POST

Requires Authentication

true

Parameters

status. Required. The text of your status update. U R L -encode as necessary.

How does this work? To publish a new message on Identi.ca, you need to issue an HTTP POST request to

http://identi.ca/api/statuses/update.format. (The format bit is not part of the U R L ; you replace it

with the data format you want the server to return in response to your request. So if you want a response

373

in XML, you would post the request to https://identi.ca/api/statuses/update.xml.) The request

needs to include a parameter called status, which contains the text of your status update. And the request

needs to be authenticated.

Authenticated? Sure. To update your status on Identi.ca, you need to prove who you are. Identi.ca is not a

wiki; only you can update your own status. Identi.ca uses HTTP Basic Authentication ( a.k.a. RFC 2617) over S S L to provide secure but easy-to-use authentication. httplib2 supports both S S L and H T T P Basic

Authentication, so this part is easy.

A POST request is different from a GET request, because it includes a payload. The payload is the data you

want to send to the server. The one piece of data that this API method requires is status, and it should be

U R L -encoded. This is a very simple serialization format that takes a set of key-value pairs ( i.e. a dictionary) and transforms it into a string.

>>> from urllib.parse import urlencode

>>> data = {'status': 'Test update from Python 3'}

>>> urlencode(data)

'status=Test+update+from+Python+3'

1. Python comes with a utility function to URL-encode a dictionary: urllib.parse.urlencode().

2. This is the sort of dictionary that the Identi.ca API is looking for. It contains one key, status, whose value is

the text of a single status update.

3. This is what the URL-encoded string looks like. This is the payload that will be sent “on the wire” to the

Identi.ca API server in your HTTP POST request.

374

>>> from urllib.parse import urlencode

>>> import httplib2

>>> httplib2.debuglevel = 1

>>> h = httplib2.Http('.cache')

>>> data = {'status': 'Test update from Python 3'}

>>> h.add_credentials('diveintomark', 'MY_SECRET_PASSWORD', 'identi.ca')

>>> resp, content = h.request('https://identi.ca/api/statuses/update.xml',

...

'POST',

...

urlencode(data),

...

headers={'Content-Type': 'application/x-www-form-urlencoded'})

1. This is how httplib2 handles authentication. Store your username and password with the

add_credentials() method. When httplib2 tries to issue the request, the server will respond with a 401

Unauthorized status code, and it will list which authentication methods it supports (in the WWW-

Authenticate header). httplib2 will automatically construct an Authorization header and re-request the

U R L .

2. The second parameter is the type of HTTP request, in this case POST.

3. The third parameter is the payload to send to the server. We’re sending the URL-encoded dictionary with a

status message.

4. Finally, we need to tell the server that the payload is URL-encoded data.

☞ The third parameter to the add_credentials() method is the domain in which the

credentials are valid. You should always specify this! If you leave out the domain and

later reuse the httplib2.Http object on a different authenticated site, httplib2

might end up leaking one site’s username and password to the other site.

This is what goes over the wire:

375

# continued from the previous example

send: b'POST /api/statuses/update.xml HTTP/1.1

Host: identi.ca

Accept-Encoding: identity

Content-Length: 32

content-type: application/x-www-form-urlencoded

user-agent: Python-httplib2/$Rev: 259 $

status=Test+update+from+Python+3'

reply: 'HTTP/1.1 401 Unauthorized'

send: b'POST /api/statuses/update.xml HTTP/1.1

Host: identi.ca

Accept-Encoding: identity

Content-Length: 32

content-type: application/x-www-form-urlencoded

authorization: Basic SECRET_HASH_CONSTRUCTED_BY_HTTPLIB2

user-agent: Python-httplib2/$Rev: 259 $

status=Test+update+from+Python+3'

reply: 'HTTP/1.1 200 OK'

1. After the first request, the server responds with a 401 Unauthorized status code. httplib2 will never send

authentication headers unless the server explicitly asks for them. This is how the server asks for them.

2. httplib2 immediately turns around and requests the same URL a second time.

3. This time, it includes the username and password that you added with the add_credentials() method.

4. It worked!

What does the server send back after a successful request? That depends entirely on the web service API. In

some protocols (like the Atom Publishing Protocol), the server sends back a 201 Created status code and the location of the newly created resource in the Location header. Identi.ca sends back a 200 OK and an

X M L document containing information about the newly created resource.

376

# continued from the previous example

>>> print(content.decode('utf-8'))

<?xml version="1.0" encoding="UTF-8"?>

<status>

<text>Test update from Python 3</text>

<truncated>false</truncated>

<created_at>Wed Jun 10 03:53:46 +0000 2009</created_at>

<in_reply_to_status_id></in_reply_to_status_id>

<source>api</source>

<id>5131472</id>

<in_reply_to_user_id></in_reply_to_user_id>

<in_reply_to_screen_name></in_reply_to_screen_name>

<favorited>false</favorited>

<user>

<id>3212</id>

<name>Mark Pilgrim</name>

<screen_name>diveintomark</screen_name>

<location>27502, US</location>

<description>tech writer, husband, father</description>

<profile_image_url>http://avatar.identi.ca/3212-48-20081216000626.png</profile_image_url>

<url>http://diveintomark.org/</url>

<protected>false</protected>

<followers_count>329</followers_count>

<profile_background_color></profile_background_color>

<profile_text_color></profile_text_color>

<profile_link_color></profile_link_color>

<profile_sidebar_fill_color></profile_sidebar_fill_color>

<profile_sidebar_border_color></profile_sidebar_border_color>

<friends_count>2</friends_count>

<created_at>Wed Jul 02 22:03:58 +0000 2008</created_at>

<favourites_count>30768</favourites_count>

<utc_offset>0</utc_offset>

<time_zone>UTC</time_zone>

<profile_background_image_url></profile_background_image_url>

377

阅读 ‧ 电子书库

<profile_background_tile>false</profile_background_tile>

<statuses_count>122</statuses_count>

<following>false</following>

<notifications>false</notifications>

</user>

</status>

1. Remember, the data returned by httplib2 is always bytes, not a string. To convert it to a string, you need to decode it using the proper character encoding. Identi.ca’s API always returns results in UTF-8, so that

part is easy.

2. There’s the text of the status message we just published.

3. There’s the unique identifier for the new status message. Identi.ca uses this to construct a URL for viewing

the message on the web.

And here it is:

378

14.7. BEYOND HTTP POST

H T T P isn’t limited to GET and POST. Those are certainly the most common types of requests, especially in

web browsers. But web service APIs can go beyond GET and POST, and httplib2 is ready.

# continued from the previous example

>>> from xml.etree import ElementTree as etree

>>> tree = etree.fromstring(content)

>>> status_id = tree.findtext('id')

>>> status_id

'5131472'

>>> url = 'https://identi.ca/api/statuses/destroy/{0}.xml'.format(status_id)

>>> resp, deleted_content = h.request(url, 'DELETE')

1. The server returned XML, right? You know how to parse XML.

2. The findtext() method finds the first instance of the given expression and extracts its text content. In this

case, we’re just looking for an <id> element.

3. Based on the text content of the <id> element, we can construct a URL to delete the status message we

just published.

4. To delete a message, you simply issue an HTTP DELETE request to that URL.

This is what goes over the wire:

379

send: b'DELETE /api/statuses/destroy/5131472.xml HTTP/1.1

Host: identi.ca

Accept-Encoding: identity

user-agent: Python-httplib2/$Rev: 259 $

'

reply: 'HTTP/1.1 401 Unauthorized'

send: b'DELETE /api/statuses/destroy/5131472.xml HTTP/1.1

Host: identi.ca

Accept-Encoding: identity

authorization: Basic SECRET_HASH_CONSTRUCTED_BY_HTTPLIB2

user-agent: Python-httplib2/$Rev: 259 $

'

reply: 'HTTP/1.1 200 OK'

>>> resp.status

200

1. “Delete this status message.”

2. “I’m sorry, Dave, I’m afraid I can’t do that.”

3. “Unauthorized‽ Hmmph. Delete this status message, please

4. …and here’s my username and password.”

5. “Consider it done!”

And just like that, poof, it’s gone.

380

阅读 ‧ 电子书库

14.8. FURTHER READING

httplib2:

httplib2 project page

More httplib2 code examples

Doing HTTP Caching Right: Introducing httplib2

httplib2: HTTP Persistence and Authentication

H T T P caching:

HTTP Caching Tutorial by Mark Nottingham

How to control caching with HTTP headers on Google Doctype 381

R F C s:

RFC 2616: HTTP

RFC 2617: HTTP Basic Authentication

RFC 1951: deflate compression

RFC 1952: gzip compression

382

CHAPTER 15. CASE STUDY: PORTING chardet TO

PYTHON 3

Words, words. They’re all we have to go on.

Rosencrantz and Guildenstern are Dead

15.1. DIVING IN

Question:what’sthe#1causeofgibberishtextontheweb,inyourinbox,andacrosseverycomputer

system ever written? It’s character encoding. In the Strings chapter, I talked about the history of character encoding and the creation of Unicode, the “one encoding to rule them all.” I’d love it if I never had to see a

gibberish character on a web page again, because all authoring systems stored accurate encoding information,

all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity

when converting between encodings.

I’d also like a pony.

A Unicode pony.

A Unipony, as it were.

I’ll settle for character encoding auto-detection.

383

15.2. WHAT IS CHARACTER ENCODING AUTO-DETECTION?

It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the

encoding so you can read the text. It’s like cracking a code when you don’t have the decryption key.

15.2.1. ISN’T THAT IMPOSSIBLE?

In general, yes. However, some encodings are optimized for specific languages, and languages are not

random. Some character sequences pop up all the time, while other sequences make no sense. A person

fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that

that isn’t English (even though it is composed entirely of English letters). By studying lots of “typical” text, a

computer algorithm can simulate this kind of fluency and make an educated guess about a text’s language.

In other words, encoding detection is really language detection, combined with knowledge of which languages

tend to use which character encodings.

15.2.2. DOES SUCH AN ALGORITHM EXIST?

As it turns out, yes. All major browsers have character encoding auto-detection, because the web is full of

pages that have no encoding information whatsoever. Mozilla Firefox contains an encoding auto-detection

library which is open source. I ported the library to Python 2 and dubbed it the chardet module. This chapter will take you step-by-step through the process of porting the chardet module from Python 2 to

Python 3.

15.3. INTRODUCING THE chardet MODULE

Before we set off porting the code, it would help if you understood how the code worked! This is a brief

guide to navigating the code itself. The chardet library is too large to include inline here, but you can

download it from chardet.feedparser.org.

384

The main entry point for the detection algorithm is

universaldetector.py, which has one class,

UniversalDetector. (You might think the main entry

point is the detect function in chardet/__init__.py,

but that’s really just a convenience function that creates

Encoding

a UniversalDetector object, calls it, and returns its

result.)

detection is

There are 5 categories of encodings that

really

UniversalDetector handles:

language

1. UTF-N with a Byte Order Mark (BOM). This includes

U T F -8 , both Big-Endian and Little-Endian variants of

detection in

U T F -16 , and all 4 byte-order variants of U T F -32 .

2. Escaped encodings, which are entirely 7-bit ASCII

drag.

compatible, where non-ASCII characters start with an

escape sequence. Examples: ISO-2022-JP (Japanese) and

H Z - G B -2312 (Chinese).

3. Multi-byte encodings, where each character is

represented by a variable number of bytes. Examples: BIG5 (Chinese), SHIFT_JIS (Japanese), EUC-KR

(Korean), and UTF-8 without a BOM.

4. Single-byte encodings, where each character is represented by one byte. Examples: KOI8-R (Russian),

W I N D O W S -1255 (Hebrew), and T I S -620 (Thai).

5. WINDOWS-1252, which is used primarily on Microsoft Windows by middle managers who wouldn’t know a

character encoding from a hole in the ground.

15.3.1. UTF-N WITH A BOM

If the text starts with a BOM, we can reasonably assume that the text is encoded in UTF-8, UTF-16, or

U T F -32 . (The B O M will tell us exactly which one; that’s what it’s for.) This is handled inline in

UniversalDetector, which returns the result immediately without any further processing.

385

15.3.2. ESCAPED ENCODINGS

If the text contains a recognizable escape sequence that might indicate an escaped encoding,

UniversalDetector creates an EscCharSetProber (defined in escprober.py) and feeds it the text.

EscCharSetProber creates a series of state machines, based on models of H Z - G B -2312 , I S O -2022- C N ,

I S O -2022- J P , and I S O -2022- K R (defined in escsm.py). EscCharSetProber feeds the text to each of these

state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding,

EscCharSetProber immediately returns the positive result to UniversalDetector, which returns it to the

caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other

state machines.

15.3.3. MULTI-BYTE ENCODINGS

Assuming no BOM, UniversalDetector checks whether the text contains any high-bit characters. If so, it

creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort,

windows-1252.

The multi-byte encoding prober, MBCSGroupProber (defined in mbcsgroupprober.py), is really just a shell

that manages a group of other probers, one for each multi-byte encoding: BIG5, GB2312, EUC-TW, EUC-

K R , E U C - J P , S H I F T _ J I S , and U T F -8 . MBCSGroupProber feeds the text to each of these encoding-specific

probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped

from further processing (so that, for instance, any subsequent calls to UniversalDetector.feed() will skip

that prober). If a prober reports that it is reasonably confident that it has detected the encoding,

MBCSGroupProber reports this positive result to UniversalDetector, which reports the result to the caller.

Most of the multi-byte encoding probers are inherited from MultiByteCharSetProber (defined in

mbcharsetprober.py), and simply hook up the appropriate state machine and distribution analyzer and let

MultiByteCharSetProber do the rest of the work. MultiByteCharSetProber runs the text through the

encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a

conclusive positive or negative result. At the same time, MultiByteCharSetProber feeds the text to an

encoding-specific distribution analyzer.

386

The distribution analyzers (each defined in chardistribution.py) use language-specific models of which

characters are used most frequently. Once MultiByteCharSetProber has fed enough text to the distribution

analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total

number of characters, and a language-specific distribution ratio. If the confidence is high enough,

MultiByteCharSetProber returns the result to MBCSGroupProber, which returns it to UniversalDetector,

which returns it to the caller.

The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to

distinguish between EUC-JP and SHIFT_JIS, so the SJISProber (defined in sjisprober.py) also uses

2-character distribution analysis. SJISContextAnalysis and EUCJPContextAnalysis (both defined in

jpcntx.py and both inheriting from a common JapaneseContextAnalysis class) check the frequency of

Hiragana syllabary characters within the text. Once enough text has been processed, they return a