预计阅读本页时间:-
confidence level to SJISProber, which checks both analyzers and returns the higher confidence level to
MBCSGroupProber.
15.3.4. SINGLE-BYTE ENCODINGS
广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元
The single-byte encoding prober, SBCSGroupProber
(defined in sbcsgroupprober.py), is also just a shell
that manages a group of other probers, one for each
combination of single-byte encoding and language:
windows-1251, KOI8-R, ISO-8859-5, MacCyrillic,
Seriously,
IBM855, and IBM866 (Russian); ISO-8859-7 and
windows-1253 (Greek); ISO-8859-5 and windows-1251
where’s my
(Bulgarian); ISO-8859-2 and windows-1250 (Hungarian);
TIS-620 (Thai); windows-1255 and ISO-8859-8
Unicode
(Hebrew).
pony?
SBCSGroupProber feeds the text to each of these
encoding+language-specific probers and checks the
results. These probers are all implemented as a single
class, SingleByteCharSetProber (defined in
sbcharsetprober.py), which takes a language model as an argument. The language model defines how
frequently different 2-character sequences appear in typical text. SingleByteCharSetProber processes the
387
text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it
calculates a confidence level based on the number of frequently-used sequences, the total number of
characters, and a language-specific distribution ratio.
Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution
analysis, HebrewProber (defined in hebrewprober.py) tries to distinguish between Visual Hebrew (where the
source text actually stored “backwards” line-by-line, and then displayed verbatim so it can be read from right
to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left
by the client). Because certain characters are encoded differently based on whether they appear in the
middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and
return the appropriate encoding (windows-1255 for Logical Hebrew, or ISO-8859-8 for Visual Hebrew).
15.3.5. windows-1252
If UniversalDetector detects a high-bit character in the text, but none of the other multi-byte or single-
byte encoding probers return a confident result, it creates a Latin1Prober (defined in latin1prober.py) to
try to detect English text in a windows-1252 encoding. This detection is inherently unreliable, because
English letters are encoded in the same way in many different encodings. The only way to distinguish
windows-1252 is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols,
and the like. Latin1Prober automatically reduces its confidence rating to allow more accurate probers to
win if at all possible.
⁂
15.4. RUNNING 2to3
We’re going to migrate the chardet module from Python 2 to Python 3. Python 3 comes with a utility
script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it
can to Python 3. In some cases this is easy — a function was renamed or moved to a different
module — but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the
appendix, Porting code to Python 3 with 2to3. In this chapter, we’ll start by running 2to3 on the chardet 388
package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed
their magic.
The main chardet package is split across several different files, all in the same directory. The 2to3 script
makes it easy to convert multiple files at once: just pass a directory as a command line argument, and 2to3
will convert each of the files in turn.
389
C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w chardet\
RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: set_literal
RefactoringTool: Skipping implicit fixer: ws_comma
--- chardet\__init__.py (original)
+++ chardet\__init__.py (refactored)
@@ -18,7 +18,7 @@
__version__ = "1.0.1"
def detect(aBuf):
- import universaldetector
+ from . import universaldetector
u = universaldetector.UniversalDetector()
u.reset()
u.feed(aBuf)
--- chardet\big5prober.py (original)
+++ chardet\big5prober.py (refactored)
@@ -25,10 +25,10 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-from mbcharsetprober import MultiByteCharSetProber
-from codingstatemachine import CodingStateMachine
-from chardistribution import Big5DistributionAnalysis
-from mbcssm import Big5SMModel
+from .mbcharsetprober import MultiByteCharSetProber
+from .codingstatemachine import CodingStateMachine
+from .chardistribution import Big5DistributionAnalysis
+from .mbcssm import Big5SMModel
class Big5Prober(MultiByteCharSetProber):
def __init__(self):
--- chardet\chardistribution.py (original)
390
+++ chardet\chardistribution.py (refactored)
@@ -25,12 +25,12 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-import constants
-from euctwfreq import EUCTWCharToFreqOrder, EUCTW_TABLE_SIZE, EUCTW_TYPICAL_DISTRIBUTION_RATIO
-from euckrfreq import EUCKRCharToFreqOrder, EUCKR_TABLE_SIZE, EUCKR_TYPICAL_DISTRIBUTION_RATIO
-from gb2312freq import GB2312CharToFreqOrder, GB2312_TABLE_SIZE, GB2312_TYPICAL_DISTRIBUTION_RATIO
-from big5freq import Big5CharToFreqOrder, BIG5_TABLE_SIZE, BIG5_TYPICAL_DISTRIBUTION_RATIO
-from jisfreq import JISCharToFreqOrder, JIS_TABLE_SIZE, JIS_TYPICAL_DISTRIBUTION_RATIO
+from . import constants
+from .euctwfreq import EUCTWCharToFreqOrder, EUCTW_TABLE_SIZE, EUCTW_TYPICAL_DISTRIBUTION_RATIO
+from .euckrfreq import EUCKRCharToFreqOrder, EUCKR_TABLE_SIZE, EUCKR_TYPICAL_DISTRIBUTION_RATIO
+from .gb2312freq import GB2312CharToFreqOrder, GB2312_TABLE_SIZE, GB2312_TYPICAL_DISTRIBUTION_RATIO
+from .big5freq import Big5CharToFreqOrder, BIG5_TABLE_SIZE, BIG5_TYPICAL_DISTRIBUTION_RATIO
+from .jisfreq import JISCharToFreqOrder, JIS_TABLE_SIZE, JIS_TYPICAL_DISTRIBUTION_RATIO
ENOUGH_DATA_THRESHOLD = 1024
SURE_YES = 0.99
.
.
. (it goes on like this for a while)
.
.
RefactoringTool: Files that were modified:
RefactoringTool: chardet\__init__.py
RefactoringTool: chardet\big5prober.py
RefactoringTool: chardet\chardistribution.py
RefactoringTool: chardet\charsetgroupprober.py
RefactoringTool: chardet\codingstatemachine.py
RefactoringTool: chardet\constants.py
RefactoringTool: chardet\escprober.py
RefactoringTool: chardet\escsm.py
391
RefactoringTool: chardet\eucjpprober.py
RefactoringTool: chardet\euckrprober.py
RefactoringTool: chardet\euctwprober.py
RefactoringTool: chardet\gb2312prober.py
RefactoringTool: chardet\hebrewprober.py
RefactoringTool: chardet\jpcntx.py
RefactoringTool: chardet\langbulgarianmodel.py
RefactoringTool: chardet\langcyrillicmodel.py
RefactoringTool: chardet\langgreekmodel.py
RefactoringTool: chardet\langhebrewmodel.py
RefactoringTool: chardet\langhungarianmodel.py
RefactoringTool: chardet\langthaimodel.py
RefactoringTool: chardet\latin1prober.py
RefactoringTool: chardet\mbcharsetprober.py
RefactoringTool: chardet\mbcsgroupprober.py
RefactoringTool: chardet\mbcssm.py
RefactoringTool: chardet\sbcharsetprober.py
RefactoringTool: chardet\sbcsgroupprober.py
RefactoringTool: chardet\sjisprober.py
RefactoringTool: chardet\universaldetector.py
RefactoringTool: chardet\utf8prober.py
Now run the 2to3 script on the testing harness, test.py.
392
C:\home\chardet> python c:\Python30\Tools\Scripts\2to3.py -w test.py
RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: set_literal
RefactoringTool: Skipping implicit fixer: ws_comma
--- test.py (original)
+++ test.py (refactored)
@@ -4,7 +4,7 @@
count = 0
u = UniversalDetector()
for f in glob.glob(sys.argv[1]):
- print f.ljust(60),
+ print(f.ljust(60), end=' ')
u.reset()
for line in file(f, 'rb'):
u.feed(line)
@@ -12,8 +12,8 @@
u.close()
result = u.result
if result['encoding']:
- print result['encoding'], 'with confidence', result['confidence']
+ print(result['encoding'], 'with confidence', result['confidence'])
else:
- print '******** no result'
+ print('******** no result')
count += 1
-print count, 'tests'
+print(count, 'tests')
RefactoringTool: Files that were modified:
RefactoringTool: test.py
Well, that wasn’t so hard. Just a few imports and print statements to convert. Speaking of which, what was
the problem with all those import statements? To answer that, you need to understand how the chardet
module is split into multiple files.
393
⁂
15.5. A SHORT DIGRESSION INTO MULTI-FILE MODULES
chardet is a multi-file module. I could have chosen to put all the code in one file (named chardet.py), but I
didn’t. Instead, I made a directory (named chardet), then I made an __init__.py file in that directory. If
Python sees an __init__.py file in a directory, it assumes that all of the files in that directory are part of the same module. The module’s name is the name of the directory. Files within the directory can reference other files
within the same directory, or even within subdirectories. (More on that in a minute.) But the entire
collection of files is presented to other Python code as a single module — as if all the functions and classes
were in a single .py file.
What goes in the __init__.py file? Nothing. Everything. Something in between. The __init__.py file
doesn’t need to define anything; it can literally be an empty file. Or you can use it to define your main entry
point functions. Or you put all your functions in it. Or all but one.
☞ A directory with an __init__.py file is always treated as a multi-file module.
Without an __init__.py file, a directory is just a directory of unrelated .py files.
Let’s see how that works in practice.
>>> import chardet
>>> dir(chardet)
①
['__builtins__', '__doc__', '__file__', '__name__',
'__package__', '__path__', '__version__', 'detect']
>>> chardet
②
<module 'chardet' from 'C:\Python31\lib\site-packages\chardet\__init__.py'>
1. Other than the usual class attributes, the only thing in the chardet module is a detect() function.
2. Here’s your first clue that the chardet module is more than just a file: the “module” is listed as the
__init__.py file within the chardet/ directory.
394
Let’s take a peek in that __init__.py file.
def detect(aBuf):
①
from . import universaldetector
②
u = universaldetector.UniversalDetector()
u.reset()
u.feed(aBuf)
u.close()
return u.result
1. The __init__.py file defines the detect() function, which is the main entry point into the chardet library.
2. But the detect() function hardly has any code! In fact, all it really does is import the universaldetector
module and start using it. But where is universaldetector defined?
The answer lies in that odd-looking import statement:
from . import universaldetector
Translated into English, that means “import the universaldetector module; that’s in the same directory I
am,” where “I” is the chardet/__init__.py file. This is called a relative import. It’s a way for the files within
a multi-file module to reference each other, without worrying about naming conflicts with other modules you
may have installed in your import search path. This import statement will only look for the universaldetector module within the chardet/ directory itself.
These two concepts — __init__.py and relative imports — mean that you can break up your module into
as many pieces as you like. The chardet module comprises 36 .py files — 36! Yet all you need to do to
start using it is import chardet, then you can call the main chardet.detect() function. Unbeknownst to
your code, the detect() function is actually defined in the chardet/__init__.py file. Also unbeknownst to
you, the detect() function uses a relative import to reference a class defined in chardet/
universaldetector.py, which in turn uses relative imports on five other files, all contained in the chardet/
directory.
☞
395
If you ever find yourself writing a large library in Python (or more likely, when you
realize that your small library has grown into a large one), take the time to refactor
it into a multi-file module. It’s one of the many things Python is good at, so take
advantage of it.
⁂
15.6. FIXING WHAT 2to3 CAN’T
15.6.1. False IS INVALID SYNTAX
Now for the real test: running the test harness against
the test suite. Since the test suite is designed to cover
all the possible code paths, it’s a good way to test our
ported code to make sure there aren’t any bugs lurking
anywhere.
You do have
C:\home\chardet> python test.py tests\*\*
tests, right?
Traceback (most recent call last):
File "test.py", line 1, in <module>
from chardet.universaldetector import UniversalDetector
File "C:\home\chardet\chardet\universaldetector.py", line 51
self.done = constants.False
^
SyntaxError: invalid syntax
Hmm, a small snag. In Python 3, False is a reserved word, so you can’t use it as a variable name. Let’s look
at constants.py to see where it’s defined. Here’s the original version from constants.py, before the 2to3
script changed it:
396
import __builtin__
if not hasattr(__builtin__, 'False'):
False = 0
True = 1
else:
False = __builtin__.False
True = __builtin__.True
This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python
2.3, Python had no built-in bool type. This code detects the absence of the built-in constants True and
False, and defines them if necessary.
However, Python 3 will always have a bool type, so this entire code snippet is unnecessary. The simplest
solution is to replace all instances of constants.True and constants.False with True and False,
respectively, then delete this dead code from constants.py.
So this line in universaldetector.py:
self.done = constants.False
Becomes
self.done = False
Ah, wasn’t that satisfying? The code is shorter and more readable already.
15.6.2. NO MODULE NAMED constants
Time to run test.py again and see how far it gets.
397
C:\home\chardet> python test.py tests\*\*
Traceback (most recent call last):
File "test.py", line 1, in <module>
from chardet.universaldetector import UniversalDetector
File "C:\home\chardet\chardet\universaldetector.py", line 29, in <module>
import constants, sys
ImportError: No module named constants
What’s that you say? No module named constants? Of course there’s a module named constants. It’s
right there, in chardet/constants.py.
Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative
imports — that is, modules that import other modules within the same library — but the logic behind relative imports has changed in Python 3. In Python 2, you could just import constants and it would look in the
chardet/ directory first. In Python 3, all import statements are absolute by default. If you want to do a relative import in Python 3, you need to be explicit about it:
from . import constants
But wait. Wasn’t the 2to3 script supposed to take care of these for you? Well, it did, but this particular
import statement combines two different types of imports into one line: a relative import of the constants
module within the library, and an absolute import of the sys module that is pre-installed in the Python
standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t,
and the 2to3 script is not smart enough to split the import statement into two.
The solution is to split the import statement manually. So this two-in-one import:
import constants, sys
Needs to become two separate imports:
from . import constants
import sys
398
There are variations of this problem scattered throughout the chardet library. In some places it’s “import
constants, sys”; in other places, it’s “import constants, re”. The fix is the same: manually split the
import statement into two lines, one for the relative import, the other for the absolute import.
Onward!
15.6.3. NAME 'file' IS NOT DEFINED
And here we go again, running test.py to try to
execute our test cases…
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml
open() is the
Traceback (most recent call last):
File "test.py", line 9, in <module>
new file().
for line in file(f, 'rb'):
NameError: name 'file' is not defined
PapayaWhip
This one surprised me, because I’ve been using this
is the new
idiom as long as I can remember. In Python 2, the global
file() function was an alias for the open() function,
black.
which was the standard way of opening text files for
reading. In Python 3, the global file() function no
longer exists, but the open() function still exists.
Thus, the simplest solution to the problem of the missing file() is to call the open() function instead:
for line in open(f, 'rb'):
And that’s all I have to say about that.
15.6.4. CAN’T USE A STRING PATTERN ON A BYTES-LIKE OBJECT
Now things are starting to get interesting. And by “interesting,” I mean “confusing as all hell.”
399
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml
Traceback (most recent call last):
File "test.py", line 10, in <module>
u.feed(line)
File "C:\home\chardet\chardet\universaldetector.py", line 98, in feed
if self._highBitDetector.search(aBuf):
TypeError: can't use a string pattern on a bytes-like object
To debug this, let’s see what self._highBitDetector is. It’s defined in the __init__ method of the
UniversalDetector class:
class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(r'[\x80-\xFF]')
This pre-compiles a regular expression designed to find non-ASCII characters in the range 128–255
(0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is
designed to find non-ASCII bytes in the range 128-255.
And therein lies the problem.
In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted
Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in
Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters
(of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be
used to search a string — again, an array of characters. But what we’re searching is not a string, it’s a byte
array. Looking at the traceback, this error occurred in universaldetector.py:
400
def feed(self, aBuf):
.
.
.
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that
calls it is the test harness, test.py.
u = UniversalDetector()
.
.
.
for line in open(f, 'rb'):
u.feed(line)
And here we find our answer: in the
UniversalDetector.feed() method, aBuf is a line
read from a file on disk. Look carefully at the
parameters used to open the file: 'rb'. 'r' is for
“read”; OK, big deal, we’re reading the file. Ah, but 'b'
Not an
is for “binary.” Without the 'b' flag, this for loop
would read the file, line by line, and convert each line
array of
into a string — an array of Unicode
characters — according to the system default character
characters,
encoding. But with the 'b' flag, this for loop reads the
file, line by line, and stores each line exactly as it
but an
appears in the file, as an array of bytes. That byte array
gets passed to UniversalDetector.feed(), and
eventually gets passed to the pre-compiled regular
expression, self._highBitDetector, to search for
high-bit… characters. But we don’t have characters; we have bytes. Oops.
401
What we need this regular expression to search is not
an array of characters, but an array of bytes.
Once you realize that, the solution is not difficult.
Regular expressions defined with strings can search
array of
strings. Regular expressions defined with byte arrays can
search byte arrays. To define a byte array pattern, we
bytes.
simply change the type of the argument we use to
define the regular expression to a byte array. (There is
one other case of this same problem, on the very next
line.)
class UniversalDetector:
def __init__(self):
- self._highBitDetector = re.compile(r'[\x80-\xFF]')
- self._escDetector = re.compile(r'(\033|~{)')
+ self._highBitDetector = re.compile(b'[\x80-\xFF]')
+ self._escDetector = re.compile(b'(\033|~{)')
self._mEscCharSetProber = None
self._mCharSetProbers = []
self.reset()
Searching the entire codebase for other uses of the re module turns up two more instances, in
charsetprober.py. Again, the code is defining regular expressions as strings but executing them on aBuf,
which is a byte array. The solution is the same: define the regular expression patterns as byte arrays.
402
class CharSetProber:
.
.
.
def filter_high_bit_only(self, aBuf):
- aBuf = re.sub(r'([\x00-\x7F])+', ' ', aBuf)
+ aBuf = re.sub(b'([\x00-\x7F])+', b' ', aBuf)
return aBuf
def filter_without_english_letters(self, aBuf):
- aBuf = re.sub(r'([A-Za-z])+', ' ', aBuf)
+ aBuf = re.sub(b'([A-Za-z])+', b' ', aBuf)
return aBuf
15.6.5. CAN'T CONVERT 'bytes' OBJECT TO str IMPLICITLY
Curiouser and curiouser…
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml
Traceback (most recent call last):
File "test.py", line 10, in <module>
u.feed(line)
File "C:\home\chardet\chardet\universaldetector.py", line 100, in feed
elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
TypeError: Can't convert 'bytes' object to str implicitly
There’s an unfortunate clash of coding style and Python interpreter here. The TypeError could be anywhere
on that line, but the traceback doesn’t tell you exactly where it is. It could be in the first conditional or the
second, and the traceback would look the same. To narrow it down, you should split the line in half, like
this:
elif (self._mInputState == ePureAscii) and \
self._escDetector.search(self._mLastChar + aBuf):
403
And re-run the test:
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml
Traceback (most recent call last):
File "test.py", line 10, in <module>
u.feed(line)
File "C:\home\chardet\chardet\universaldetector.py", line 101, in feed
self._escDetector.search(self._mLastChar + aBuf):
TypeError: Can't convert 'bytes' object to str implicitly
Aha! The problem was not in the first conditional (self._mInputState == ePureAscii) but in the second
one. So what could cause a TypeError there? Perhaps you’re thinking that the search() method is
expecting a value of a different type, but that wouldn’t generate this traceback. Python functions can take any
value; if you pass the right number of arguments, the function will execute. It may crash if you pass it a value
of a different type than it’s expecting, but if that happened, the traceback would point to somewhere inside
the function. But this traceback says it never got as far as calling the search() method. So the problem
must be in that + operation, as it’s trying to construct the value that it will eventually pass to the search()
method.
We know from previous debugging that aBuf is a byte array. So what is self._mLastChar? It’s an instance variable, defined in the reset() method, which is actually called from the __init__() method.
404
class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(b'[\x80-\xFF]')
self._escDetector = re.compile(b'(\033|~{)')
self._mEscCharSetProber = None
self._mCharSetProbers = []
self.reset()
def reset(self):
self.result = {'encoding': None, 'confidence': 0.0}
self.done = False
self._mStart = True
self._mGotData = False
self._mInputState = ePureAscii
self._mLastChar = ''
And now we have our answer. Do you see it? self._mLastChar is a string, but aBuf is a byte array. And
you can’t concatenate a string to a byte array — not even a zero-length string.
So what is self._mLastChar anyway? In the feed() method, just a few lines down from where the
trackback occurred.
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
self._mInputState = eHighbyte
elif (self._mInputState == ePureAscii) and \
self._escDetector.search(self._mLastChar + aBuf):
self._mInputState = eEscAscii
self._mLastChar = aBuf[-1]
The calling function calls this feed() method over and over again with a few bytes at a time. The method
processes the bytes it was given (passed in as aBuf), then stores the last byte in self._mLastChar in case
it’s needed during the next call. (In a multi-byte encoding, the feed() method might get called with half of a
405
character, then called again with the other half.) But because aBuf is now a byte array instead of a string,
self._mLastChar needs to be a byte array as well. Thus:
def reset(self):
.
.
.
- self._mLastChar = ''
+ self._mLastChar = b''
Searching the entire codebase for “mLastChar” turns up a similar problem in mbcharsetprober.py, but
instead of tracking the last character, it tracks the last two characters. The MultiByteCharSetProber class
uses a list of 1-character strings to track the last two characters. In Python 3, it needs to use a list of
integers, because it’s not really tracking characters, it’s tracking bytes. (Bytes are just integers from 0-255.)
class MultiByteCharSetProber(CharSetProber):
def __init__(self):
CharSetProber.__init__(self)
self._mDistributionAnalyzer = None
self._mCodingSM = None
- self._mLastChar = ['\x00', '\x00']
+ self._mLastChar = [0, 0]
def reset(self):
CharSetProber.reset(self)
if self._mCodingSM:
self._mCodingSM.reset()
if self._mDistributionAnalyzer:
self._mDistributionAnalyzer.reset()
- self._mLastChar = ['\x00', '\x00']
+ self._mLastChar = [0, 0]
406
15.6.6. UNSUPPORTED OPERAND TYPE(S) FOR +: 'int' AND 'bytes'
I have good news, and I have bad news. The good news is we’re making progress…
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml
Traceback (most recent call last):
File "test.py", line 10, in <module>
u.feed(line)
File "C:\home\chardet\chardet\universaldetector.py", line 101, in feed
self._escDetector.search(self._mLastChar + aBuf):
TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
…The bad news is it doesn’t always feel like progress.
But this is progress! Really! Even though the traceback calls out the same line of code, it’s a different error
than it used to be. Progress! So what’s the problem now? The last time I checked, this line of code didn’t
try to concatenate an int with a byte array (bytes). In fact, you just spent a lot of time ensuring that
self._mLastChar was a byte array. How did it turn into an int?
The answer lies not in the previous lines of code, but in the following lines.
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):
self._mInputState = eHighbyte
elif (self._mInputState == ePureAscii) and \
self._escDetector.search(self._mLastChar + aBuf):
self._mInputState = eEscAscii
self._mLastChar = aBuf[-1]
407
This error doesn’t occur the first time the feed()
method gets called; it occurs the second time, after
self._mLastChar has been set to the last byte of aBuf.
Well, what’s the problem with that? Getting a single
element from a byte array yields an integer, not a byte
Each item in
array. To see the difference, follow me to the
interactive shell:
a string is a
>>> aBuf = b'\xEF\xBB\xBF'
①
string. Each
>>> len(aBuf)
3
item in a
>>> mLastChar = aBuf[-1]
>>> mLastChar
②
byte array
191
>>> type(mLastChar)
③
is an
<class 'int'>
>>> mLastChar + aBuf
④
integer.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'bytes'
>>> mLastChar = aBuf[-1:]
⑤
>>> mLastChar
b'\xbf'
>>> mLastChar + aBuf
⑥
b'\xbf\xef\xbb\xbf'
1. Define a byte array of length 3.
2. The last element of the byte array is 191.
3. That’s an integer.
4. Concatenating an integer with a byte array doesn’t work. You’ve now replicated the error you just found in
universaldetector.py.
408
5. Ah, here’s the fix. Instead of taking the last element of the byte array, use list slicing to create a new byte array containing just the last element. That is, start with the last element and continue the slice until the end
of the byte array. Now mLastChar is a byte array of length 1.
6. Concatenating a byte array of length 1 with a byte array of length 3 returns a new byte array of length 4.
So, to ensure that the feed() method in universaldetector.py continues to work no matter how often
it’s called, you need to initialize self._mLastChar as a 0-length byte array, then make sure it stays a byte array.
self._escDetector.search(self._mLastChar + aBuf):
self._mInputState = eEscAscii
- self._mLastChar = aBuf[-1]
+ self._mLastChar = aBuf[-1:]
15.6.7. ord() EXPECTED STRING OF LENGTH 1, BUT int FOUND
Tired yet? You’re almost there…
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml
Traceback (most recent call last):
File "test.py", line 10, in <module>
u.feed(line)
File "C:\home\chardet\chardet\universaldetector.py", line 116, in feed
if prober.feed(aBuf) == constants.eFoundIt:
File "C:\home\chardet\chardet\charsetgroupprober.py", line 60, in feed
st = prober.feed(aBuf)
File "C:\home\chardet\chardet\utf8prober.py", line 53, in feed
codingState = self._mCodingSM.next_state(c)
File "C:\home\chardet\chardet\codingstatemachine.py", line 43, in next_state
byteCls = self._mModel['classTable'][ord(c)]
TypeError: ord() expected string of length 1, but int found
409
OK, so c is an int, but the ord() function was expecting a 1-character string. Fair enough. Where is c
defined?
# codingstatemachine.py
def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
byteCls = self._mModel['classTable'][ord(c)]
That’s no help; it’s just passed into the function. Let’s pop the stack.
# utf8prober.py
def feed(self, aBuf):
for c in aBuf:
codingState = self._mCodingSM.next_state(c)
Do you see it? In Python 2, aBuf was a string, so c was a 1-character string. (That’s what you get when you
iterate over a string — all the characters, one by one.) But now, aBuf is a byte array, so c is an int, not a
1-character string. In other words, there’s no need to call the ord() function because c is already an int!
Thus:
def next_state(self, c):
# for each byte we get its class
# if it is first byte, we also get byte length
- byteCls = self._mModel['classTable'][ord(c)]
+ byteCls = self._mModel['classTable'][c]
Searching the entire codebase for instances of “ord(c)” uncovers similar problems in
sbcharsetprober.py…
410
# sbcharsetprober.py
def feed(self, aBuf):
if not self._mModel['keepEnglishLetter']:
aBuf = self.filter_without_english_letters(aBuf)
aLen = len(aBuf)
if not aLen:
return self.get_state()
for c in aBuf:
order = self._mModel['charToOrderMap'][ord(c)]
…and latin1prober.py…
# latin1prober.py
def feed(self, aBuf):
aBuf = self.filter_with_english_letters(aBuf)
for c in aBuf:
charClass = Latin1_CharToClass[ord(c)]
c is iterating over aBuf, which means it is an integer, not a 1-character string. The solution is the same:
change ord(c) to just plain c.
411
# sbcharsetprober.py
def feed(self, aBuf):
if not self._mModel['keepEnglishLetter']:
aBuf = self.filter_without_english_letters(aBuf)
aLen = len(aBuf)
if not aLen:
return self.get_state()
for c in aBuf:
- order = self._mModel['charToOrderMap'][ord(c)]
+ order = self._mModel['charToOrderMap'][c]
# latin1prober.py
def feed(self, aBuf):
aBuf = self.filter_with_english_letters(aBuf)
for c in aBuf:
- charClass = Latin1_CharToClass[ord(c)]
+ charClass = Latin1_CharToClass[c]
15.6.8. UNORDERABLE TYPES: int() >= str()
Let’s go again.
412
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml
Traceback (most recent call last):
File "test.py", line 10, in <module>
u.feed(line)
File "C:\home\chardet\chardet\universaldetector.py", line 116, in feed
if prober.feed(aBuf) == constants.eFoundIt:
File "C:\home\chardet\chardet\charsetgroupprober.py", line 60, in feed
st = prober.feed(aBuf)
File "C:\home\chardet\chardet\sjisprober.py", line 68, in feed
self._mContextAnalyzer.feed(self._mLastChar[2 - charLen :], charLen)
File "C:\home\chardet\chardet\jpcntx.py", line 145, in feed
order, charLen = self.get_order(aBuf[i:i+2])
File "C:\home\chardet\chardet\jpcntx.py", line 176, in get_order
if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \
TypeError: unorderable types: int() >= str()
So what’s this all about? “Unorderable types”? Once again, the difference between byte arrays and strings is
rearing its ugly head. Take a look at the code:
class SJISContextAnalysis(JapaneseContextAnalysis):
def get_order(self, aStr):
if not aStr: return -1, 1
# find out current char's byte length
if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \
((aStr[0] >= '\xE0') and (aStr[0] <= '\xFC')):
charLen = 2
else:
charLen = 1
And where does aStr come from? Let’s pop the stack:
413
def feed(self, aBuf, aLen):
.
.
.
i = self._mNeedToSkipCharNum
while i < aLen:
order, charLen = self.get_order(aBuf[i:i+2])
Oh look, it’s our old friend, aBuf. As you might have guessed from every other issue we’ve encountered in
this chapter, aBuf is a byte array. Here, the feed() method isn’t just passing it on wholesale; it’s slicing it.
But as you saw earlier in this chapter, slicing a byte array returns a byte array, so the aStr parameter that gets passed to the get_order() method is still a byte array.
And what is this code trying to do with aStr? It’s taking the first element of the byte array and comparing it
to a string of length 1. In Python 2, that worked, because aStr and aBuf were strings, and aStr[0] would
be a string, and you can compare strings for inequality. But in Python 3, aStr and aBuf are byte arrays,
aStr[0] is an integer, and you can’t compare integers and strings for inequality without explicitly coercing
one of them.
In this case, there’s no need to make the code more complicated by adding an explicit coercion. aStr[0]
yields an integer; the things you’re comparing to are all constants. Let’s change them from 1-character strings
to integers. And while we’re at it, let’s change aStr to aBuf, since it’s not actually a string.
414
class SJISContextAnalysis(JapaneseContextAnalysis):
- def get_order(self, aStr):
- if not aStr: return -1, 1
+ def get_order(self, aBuf):
+ if not aBuf: return -1, 1
# find out current char's byte length
- if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \
- ((aBuf[0] >= '\xE0') and (aBuf[0] <= '\xFC')):
+ if ((aBuf[0] >= 0x81) and (aBuf[0] <= 0x9F)) or \
+ ((aBuf[0] >= 0xE0) and (aBuf[0] <= 0xFC)):
charLen = 2
else:
charLen = 1
# return its order if it is hiragana
- if len(aStr) > 1:
- if (aStr[0] == '\202') and \
- (aStr[1] >= '\x9F') and \
- (aStr[1] <= '\xF1'):
- return ord(aStr[1]) - 0x9F, charLen
+ if len(aBuf) > 1:
+ if (aBuf[0] == 202) and \
+ (aBuf[1] >= 0x9F) and \
+ (aBuf[1] <= 0xF1):
+ return aBuf[1] - 0x9F, charLen
return -1, charLen
class EUCJPContextAnalysis(JapaneseContextAnalysis):
- def get_order(self, aStr):
- if not aStr: return -1, 1
+ def get_order(self, aBuf):
+ if not aBuf: return -1, 1
# find out current char's byte length
415
- if (aStr[0] == '\x8E') or \
- ((aStr[0] >= '\xA1') and (aStr[0] <= '\xFE')):
+ if (aBuf[0] == 0x8E) or \
+ ((aBuf[0] >= 0xA1) and (aBuf[0] <= 0xFE)):
charLen = 2
- elif aStr[0] == '\x8F':
+ elif aBuf[0] == 0x8F:
charLen = 3
else:
charLen = 1
# return its order if it is hiragana
- if len(aStr) > 1:
- if (aStr[0] == '\xA4') and \
- (aStr[1] >= '\xA1') and \
- (aStr[1] <= '\xF3'):
- return ord(aStr[1]) - 0xA1, charLen
+ if len(aBuf) > 1:
+ if (aBuf[0] == 0xA4) and \
+ (aBuf[1] >= 0xA1) and \
+ (aBuf[1] <= 0xF3):
+ return aBuf[1] - 0xA1, charLen
return -1, charLen
Searching the entire codebase for occurrences of the ord() function uncovers the same problem in
chardistribution.py (specifically, in the EUCTWDistributionAnalysis, EUCKRDistributionAnalysis,
GB2312DistributionAnalysis, Big5DistributionAnalysis, SJISDistributionAnalysis, and
EUCJPDistributionAnalysis classes. In each case, the fix is similar to the change we made to the
EUCJPContextAnalysis and SJISContextAnalysis classes in jpcntx.py.
15.6.9. GLOBAL NAME 'reduce' IS NOT DEFINED
Once more into the breach…
416
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml
Traceback (most recent call last):
File "test.py", line 12, in <module>
u.close()
File "C:\home\chardet\chardet\universaldetector.py", line 141, in close
proberConfidence = prober.get_confidence()
File "C:\home\chardet\chardet\latin1prober.py", line 126, in get_confidence
total = reduce(operator.add, self._mFreqCounter)
NameError: global name 'reduce' is not defined
According to the official What’s New In Python 3.0 guide, the reduce() function has been moved out of the global namespace and into the functools module. Quoting the guide: “Use functools.reduce() if you
really need it; however, 99 percent of the time an explicit for loop is more readable.” You can read more
about the decision from Guido van Rossum’s weblog: The fate of reduce() in Python 3000.
def get_confidence(self):
if self.get_state() == constants.eNotMe:
return 0.01
total = reduce(operator.add, self._mFreqCounter)
The reduce() function takes two arguments — a function and a list (strictly speaking, any iterable object will
do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and
roundabout way of adding up all the items in a list and returning the result.
This monstrosity was so common that Python added a global sum() function.
417
def get_confidence(self):
if self.get_state() == constants.eNotMe:
return 0.01
- total = reduce(operator.add, self._mFreqCounter)
+ total = sum(self._mFreqCounter)
Since you’re no longer using the operator module, you can remove that import from the top of the file as
well.
from .charsetprober import CharSetProber
from . import constants
- import operator
I CAN HAZ TESTZ?
418
C:\home\chardet> python test.py tests\*\*
tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
tests\Big5\0804.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\blog.worren.net.xml Big5 with confidence 0.99
tests\Big5\carbonxiv.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\catshadow.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\coolloud.org.tw.xml Big5 with confidence 0.99
tests\Big5\digitalwall.com.xml Big5 with confidence 0.99
tests\Big5\ebao.us.xml Big5 with confidence 0.99
tests\Big5\fudesign.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\kafkatseng.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\ke207.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\leavesth.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\letterlego.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\linyijen.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\marilynwu.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\myblog.pchome.com.tw.xml Big5 with confidence 0.99
tests\Big5\oui-design.com.xml Big5 with confidence 0.99
tests\Big5\sanwenji.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\sinica.edu.tw.xml Big5 with confidence 0.99
tests\Big5\sylvia1976.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\tlkkuo.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\tw.blog.xubg.com.xml Big5 with confidence 0.99
tests\Big5\unoriginalblog.com.xml Big5 with confidence 0.99
tests\Big5\upsaid.com.xml Big5 with confidence 0.99
tests\Big5\willythecop.blogspot.com.xml Big5 with confidence 0.99
tests\Big5\ytc.blogspot.com.xml Big5 with confidence 0.99
tests\EUC-JP\aivy.co.jp.xml EUC-JP with confidence 0.99
tests\EUC-JP\akaname.main.jp.xml EUC-JP with confidence 0.99
tests\EUC-JP\arclamp.jp.xml EUC-JP with confidence 0.99
.
.
.
316 tests
419
Holy crap, it actually works! /me does a little dance
⁂
15.7. SUMMARY
What have we learned?
1. Porting any non-trivial amount of code from Python 2 to Python 3 is going to be a pain. There’s no way
around it. It’s hard.
2. The automated 2to3 tool is helpful as far as it goes, but it will only do the easy parts — function renames, module renames, syntax changes. It’s an impressive piece of engineering, but in the end it’s just an intelligent
search-and-replace bot.
3. The #1 porting problem in this library was the difference between strings and bytes. In this case that seems
obvious, since the whole point of the chardet library is to convert a stream of bytes into a string. But “a
stream of bytes” comes up more often than you might think. Reading a file in “binary” mode? You’ll get a
stream of bytes. Fetching a web page? Calling a web API? They return a stream of bytes, too.
4. You need to understand your program. Thoroughly. Preferably because you wrote it, but at the very least,
you need to be comfortable with all its quirks and musty corners. The bugs are everywhere.
5. Test cases are essential. Don’t port anything without them. The only reason I have any confidence that
chardet works in Python 3 is that I started with a test suite that exercised all major code paths. If you
don’t have any tests, write some tests before you start porting to Python 3. If you have a few tests, write
more. If you have a lot of tests, then the real fun can begin.
420
CHAPTER 16. PACKAGING PYTHON LIBRARIES
❝ You’ll find the shame is like the pain; you only feel it once. ❞
— Marquise de Merteuil, Dangerous Liaisons
16.1. DIVING IN
Realartistsship.OrsosaysSteveJobs.DoyouwanttoreleaseaPythonscript,library,framework,or
application? Excellent. The world needs more Python code. Python 3 comes with a packaging framework
called Distutils. Distutils is many things: a build tool (for you), an installation tool (for your users), a package
metadata format (for search engines), and more. It integrates with the Python Package Index (“PyPI”), a central repository for open source Python libraries.
All of these facets of Distutils center around the setup script, traditionally called setup.py. In fact, you’ve
already seen several Distutils setup scripts in this book. You used Distutils to install httplib2 in HTTP Web
Services and again to install chardet in Case Study: Porting chardet to Python 3.
In this chapter, you’ll learn how the setup scripts for chardet and httplib2 work, and you’ll step through
the process of releasing your own Python software.
421
# chardet's setup.py
from distutils.core import setup
setup(
name = "chardet",
packages = ["chardet"],
version = "1.0.2",
description = "Universal encoding detector",
author = "Mark Pilgrim",
author_email = "mark@diveintomark.org",
url = "http://chardet.feedparser.org/",
download_url = "http://chardet.feedparser.org/download/python3-chardet-1.0.1.tgz",
keywords = ["encoding", "i18n", "xml"],
classifiers = [
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Development Status :: 4 - Beta",
"Environment :: Other Environment",
"Intended Audience :: Developers",
"License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)",
"Operating System :: OS Independent",
"Topic :: Software Development :: Libraries :: Python Modules",
"Topic :: Text Processing :: Linguistic",
],
long_description = """\
Universal character encoding detector
-------------------------------------
Detects
- ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
- Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
- EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese)
- EUC-KR, ISO-2022-KR (Korean)
- KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
- ISO-8859-2, windows-1250 (Hungarian)
422
- ISO-8859-5, windows-1251 (Bulgarian)
- windows-1252 (English)
- ISO-8859-7, windows-1253 (Greek)
- ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
- TIS-620 (Thai)
This version requires Python 3 or later; a Python 2 version is available separately.
"""
)
☞ chardet and httplib2 are open source, but there’s no requirement that you release
your own Python libraries under any particular license. The process described in this
chapter will work for any Python software, regardless of license.
⁂
16.2. THINGS DISTUTILS CAN’T DO FOR YOU
Releasing your first Python package is a daunting process. (Releasing your second one is a little easier.)
Distutils tries to automate as much of it as possible, but there are some things you simply must do yourself.
• Choose a license. This is a complicated topic, fraught with politics and peril. If you wish to release your
software as open source, I humbly offer five pieces of advice:
1. Don’t write your own license.
2. Don’t write your own license.
3. Don’t write your own license.
4. It doesn’t need to be GPL, but it needs to be GPL-compatible.
5. Don’t write your own license.
• Classify your software using the PyPI classification system. I’ll explain what this means later in this chapter.
• Write a “read me” file. Don’t skimp on this. At a minimum, it should give your users an overview of
what your software does and how to install it.
423
⁂
16.3. DIRECTORY STRUCTURE
To start packaging your Python software, you need to get your files and directories in order. The httplib2
directory looks like this:
httplib2/
①
|
+--README.txt
②
|
+--setup.py
③
|
+--httplib2/
④
|
+--__init__.py
|
+--iri2uri.py
1. Make a root directory to hold everything. Give it the same name as your Python module.
2. To accomodate Windows users, your “read me” file should include a .txt extension, and it should use
Windows-style carriage returns. Just because you use a fancy text editor that runs from the command line
and includes its own macro language, that doesn’t mean you need to make life difficult for your users. (Your
users use Notepad. Sad but true.) Even if you’re on Linux or Mac OS X, your fancy text editor undoubtedly
has an option to save files with Windows-style carriage returns.
3. Your Distutils setup script should be named setup.py unless you have a good reason not to. You do not
have a good reason not to.
4. If your Python software is a single .py file, you should put it in the root directory along with your “read
me” file and your setup script. But httplib2 is not a single .py file; it’s a multi-file module. But that’s OK!
Just put the httplib2 directory in the root directory, so you have an __init__.py file within an httplib2/
directory within the httplib2/ root directory. That’s not a problem; in fact, it will simplify your packaging
process.
424
The chardet directory looks slightly different. Like httplib2, it’s a multi-file module, so there’s a chardet/
directory within the chardet/ root directory. In addition to the README.txt file, chardet has HTML-
formatted documentation in the docs/ directory. The docs/ directory contains several .html and .css files
and an images/ subdirectory, which contains several .png and .gif files. (This will be important later.) Also,
in keeping with the convention for (L)GPL-licensed software, it has a separate file called COPYING.txt which
contains the complete text of the LGPL.
chardet/
|
+--COPYING.txt
|
+--setup.py
|
+--README.txt
|
+--docs/
| |
| +--index.html
| |
| +--usage.html
| |
| +--images/ ...
|
+--chardet/
|
+--__init__.py
|
+--big5freq.py
|
+--...
⁂
425
16.4. WRITING YOUR SETUP SCRIPT
The Distutils setup script is a Python script. In theory, it can do anything Python can do. In practice, it
should do as little as possible, in as standard a way as possible. Setup scripts should be boring. The more
exotic your installation process is, the more exotic your bug reports will be.
The first line of every Distutils setup script is always the same:
from distutils.core import setup
This imports the setup() function, which is the main entry point into Distutils. 95% of all Distutils setup
scripts consist of a single call to setup() and nothing else. (I totally just made up that statistic, but if your
Distutils setup script is doing more than calling the Distutils setup() function, you should have a good
reason. Do you have a good reason? I didn’t think so.)
The setup() function can take dozens of parameters. For the sanity of everyone involved, you must use
named arguments for every parameter. This is not merely a convention; it’s a hard requirement. Your setup script will crash if you try to call the setup() function with non-named arguments.
The following named arguments are required:
• name, the name of the package.
• version, the version number of the package.
• author, your full name.
• author_email, your email address.
• url, the home page of your project. This can be your PyPI package page if you don’t have a separate project website.
Although not required, I recommend that you also include the following in your setup script:
• description, a one-line summary of the project.
• long_description, a multi-line string in reStructuredText format. PyPI converts this to HTML and displays it on your package page.
• classifiers, a list of specially-formatted strings described in the next section.
426
☞ Setup script metadata is defined in PEP 314.
Now let’s look at the chardet setup script. It has all of these required and recommended parameters, plus
one I haven’t mentioned yet: packages.
from distutils.core import setup
setup(
name = 'chardet',
packages = ['chardet'],
version = '1.0.2',
description = 'Universal encoding detector',
author='Mark Pilgrim',
...
)
The packages parameter highlights an unfortunate vocabulary overlap in the distribution process. We’ve
been talking about the “package” as the thing you’re building (and potentially listing in The Python “Package”
Index). But that’s not what this packages parameter refers to. It refers to the fact that the chardet module
is a multi-file module, sometimes known as… a “package.” The packages parameter tells Distutils to include the chardet/ directory, its __init__.py file, and all the other .py files that constitute the chardet module.
That’s kind of important; all this happy talk about documentation and metadata is irrelevant if you forget to
include the actual code!
⁂
16.5. CLASSIFYING YOUR PACKAGE
The Python Package Index (“PyPI”) contains thousands of Python libraries. Proper classification metadata will
allow people to find yours more easily. PyPI lets you browse packages by classifier. You can even select multiple classifiers to narrow your search. Classifiers are not invisible metadata that you can just ignore!
427
To classify your software, pass a classifiers parameter to the Distutils setup() function. The
classifiers parameter is a list of strings. These strings are not freeform. All classifier strings should come
from this list on PyPI.
Classifiers are optional. You can write a Distutils setup script without any classifiers at all. Don’t do that.
You should always include at least these classifiers:
• Programming Language. In particular, you should include both "Programming Language :: Python" and
"Programming Language :: Python :: 3". If you do not include these, your package will not show up in
this list of Python 3-compatible libraries, which linked from the sidebar of every single page of pypi.python.org.
• License. This is the absolute first thing I look for when I’m evaluating third-party libraries. Don’t make me hunt for this vital information. Don’t include more than one license classifier unless your software is explicitly
available under multiple licenses. (And don’t release software under multiple licenses unless you’re forced to
do so. And don’t force other people to do so. Licensing is enough of a headache; don’t make it worse.)
• Operating System. If your software only runs on Windows (or Mac OS X, or Linux), I want to know
sooner rather than later. If your software runs anywhere without any platform-specific code, use the
classifier "Operating System :: OS Independent". Multiple Operating System classifiers are only
necessary if your software requires specific support for each platform. (This is not common.)
I also recommend that you include the following classifiers:
• Development Status. Is your software beta quality? Alpha quality? Pre-alpha? Pick one. Be honest.
• Intended Audience. Who would download your software? The most common choices are Developers,
End Users/Desktop, Science/Research, and System Administrators.
• Framework. If your software is a plugin for a larger Python framework like Django or Zope, include the appropriate Framework classifier. If not, omit it.
• Topic. There are a large number of topics to choose from; choose all that apply.
16.5.1. EXAMPLES OF GOOD PACKAGE CLASSIFIERS
By way of example, here are the classifiers for Django, a production-ready, cross-platform, BSD-licensed web application framework that runs on your web server. (Django is not yet compatible with Python 3, so the
Programming Language :: Python :: 3 classifier is not listed.)
428
Programming Language :: Python
License :: OSI Approved :: BSD License
Operating System :: OS Independent
Development Status :: 5 - Production/Stable
Environment :: Web Environment
Framework :: Django
Intended Audience :: Developers
Topic :: Internet :: WWW/HTTP
Topic :: Internet :: WWW/HTTP :: Dynamic Content
Topic :: Internet :: WWW/HTTP :: WSGI
Topic :: Software Development :: Libraries :: Python Modules
Here are the classifiers for chardet, the character encoding detection library covered in Case Study: Porting
chardet to Python 3. chardet is beta quality, cross-platform, Python 3-compatible, L G P L -licensed, and intended for developers to integrate into their own products.
Programming Language :: Python
Programming Language :: Python :: 3
License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)
Operating System :: OS Independent
Development Status :: 4 - Beta
Environment :: Other Environment
Intended Audience :: Developers
Topic :: Text Processing :: Linguistic
Topic :: Software Development :: Libraries :: Python Modules
And here are the classifiers for httplib2, the library featured in the HTTP Web Services chapter. httplib2
is beta quality, cross-platform, MIT-licensed, and intended for Python developers.
429
Programming Language :: Python
Programming Language :: Python :: 3
License :: OSI Approved :: MIT License
Operating System :: OS Independent
Development Status :: 4 - Beta
Environment :: Web Environment
Intended Audience :: Developers
Topic :: Internet :: WWW/HTTP
Topic :: Software Development :: Libraries :: Python Modules
16.6. SPECIFYING ADDITIONAL FILES WITH A MANIFEST
By default, Distutils will include the following files in your release package:
• README.txt
• setup.py
• The .py files needed by the multi-file modules listed in the packages parameter
• The individual .py files listed in the py_modules parameter
That will cover all the files in the httplib2 project. But for the chardet project, we also want to include the COPYING.txt license file and the entire docs/ directory that contains images and HTML files. To tell
Distutils to include these additional files and directories when it builds the chardet release package, you
need a manifest file.
A manifest file is a text file called MANIFEST.in. Place it in the project’s root directory, next to README.txt
and setup.py. Manifest files are not Python scripts; they are text files that contain a series of “commands” in
a Distutils-defined format. Manifest commands allow you to include or exclude specific files and directories.
This is the entire manifest file for the chardet project:
include COPYING.txt
①
recursive-include docs *.html *.css *.png *.gif
②
1. The first line is self-explanatory: include the COPYING.txt file from the project’s root directory.
430
2. The second line is a bit more complicated. The recursive-include command takes a directory name and
one or more filenames. The filenames aren’t limited to specific files; they can include wildcards. This line
means “See that docs/ directory in the project’s root directory? Look in there (recursively) for .html, .css,
.png, and .gif files. I want all of them in my release package.”
All manifest commands preserve the directory structure that you set up in your project directory. That
recursive-include command is not going to put a bunch of .html and .png files in the root directory of
the release package. It’s going to maintain the existing docs/ directory structure, but only include those files
inside that directory that match the given wildcards. (I didn’t mention it earlier, but the chardet
documentation is actually written in XML and converted to HTML by a separate script. I don’t want to
include the XML files in the release package, just the HTML and the images.)
☞ Manifest files have their own unique format. See Specifying the files to distribute and
the manifest template commands for details.
To reiterate: you only need to create a manifest file if you want to include files that Distutils doesn’t include
by default. If you do need a manifest file, it should only include the files and directories that Distutils
wouldn’t otherwise find on its own.
16.7. CHECKING YOUR SETUP SCRIPT FOR ERRORS
There’s a lot to keep track of. Distutils comes with a built-in validation command that checks that all the
required metadata is present in your setup script. For example, if you forget to include the version
parameter, Distutils will remind you.
c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py check
running check
warning: check: missing required meta-data: version
Once you include a version parameter (and all the other required bits of metadata), the check command
will look like this:
431
c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py check
running check
⁂
16.8. CREATING A SOURCE DISTRIBUTION
Distutils supports building multiple types of release packages. At a minimum, you should build a “source
distribution” that contains your source code, your Distutils setup script, your “read me” file, and whatever
additional files you want to include. To build a source distribution, pass the sdist command to your Distutils setup script.
432
c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py sdist
running sdist
running check
reading manifest template 'MANIFEST.in'
writing manifest file 'MANIFEST'
creating chardet-1.0.2
creating chardet-1.0.2\chardet
creating chardet-1.0.2\docs
creating chardet-1.0.2\docs\images
copying files to chardet-1.0.2...
copying COPYING -> chardet-1.0.2
copying README.txt -> chardet-1.0.2
copying setup.py -> chardet-1.0.2
copying chardet\__init__.py -> chardet-1.0.2\chardet
copying chardet\big5freq.py -> chardet-1.0.2\chardet
...
copying chardet\universaldetector.py -> chardet-1.0.2\chardet
copying chardet\utf8prober.py -> chardet-1.0.2\chardet
copying docs\faq.html -> chardet-1.0.2\docs
copying docs\history.html -> chardet-1.0.2\docs
copying docs\how-it-works.html -> chardet-1.0.2\docs
copying docs\index.html -> chardet-1.0.2\docs
copying docs\license.html -> chardet-1.0.2\docs
copying docs\supported-encodings.html -> chardet-1.0.2\docs
copying docs\usage.html -> chardet-1.0.2\docs
copying docs\images\caution.png -> chardet-1.0.2\docs\images
copying docs\images\important.png -> chardet-1.0.2\docs\images
copying docs\images\note.png -> chardet-1.0.2\docs\images
copying docs\images\permalink.gif -> chardet-1.0.2\docs\images
copying docs\images\tip.png -> chardet-1.0.2\docs\images
copying docs\images\warning.png -> chardet-1.0.2\docs\images
creating dist
creating 'dist\chardet-1.0.2.zip' and adding 'chardet-1.0.2' to it
adding 'chardet-1.0.2\COPYING'
433
adding 'chardet-1.0.2\PKG-INFO'
adding 'chardet-1.0.2\README.txt'
adding 'chardet-1.0.2\setup.py'
adding 'chardet-1.0.2\chardet\big5freq.py'
adding 'chardet-1.0.2\chardet\big5prober.py'
...
adding 'chardet-1.0.2\chardet\universaldetector.py'
adding 'chardet-1.0.2\chardet\utf8prober.py'
adding 'chardet-1.0.2\chardet\__init__.py'
adding 'chardet-1.0.2\docs\faq.html'
adding 'chardet-1.0.2\docs\history.html'
adding 'chardet-1.0.2\docs\how-it-works.html'
adding 'chardet-1.0.2\docs\index.html'
adding 'chardet-1.0.2\docs\license.html'
adding 'chardet-1.0.2\docs\supported-encodings.html'
adding 'chardet-1.0.2\docs\usage.html'
adding 'chardet-1.0.2\docs\images\caution.png'
adding 'chardet-1.0.2\docs\images\important.png'
adding 'chardet-1.0.2\docs\images\note.png'
adding 'chardet-1.0.2\docs\images\permalink.gif'
adding 'chardet-1.0.2\docs\images\tip.png'
adding 'chardet-1.0.2\docs\images\warning.png'
removing 'chardet-1.0.2' (and everything under it)
Several things to note here:
• Distutils noticed the manifest file (MANIFEST.in).
• Distutils successfully parsed the manifest file and added the additional files we wanted — COPYING.txt and
the HTML and image files in the docs/ directory.
• If you look in your project directory, you’ll see that Distutils created a dist/ directory. Within the dist/
directory the .zip file that you can distribute.
434
c:\Users\pilgrim\chardet> dir dist
Volume in drive C has no label.
Volume Serial Number is DED5-B4F8
Directory of c:\Users\pilgrim\chardet\dist
07/30/2009 06:29 PM <DIR> .
07/30/2009 06:29 PM <DIR> ..
07/30/2009 06:29 PM 206,440 chardet-1.0.2.zip
1 File(s) 206,440 bytes
2 Dir(s) 61,424,635,904 bytes free
⁂
16.9. CREATING A GRAPHICAL INSTALLER
In my opinion, every Python library deserves a graphical installer for Windows users. It’s easy to make (even
if you don’t run Windows yourself), and Windows users appreciate it.
Distutils can create a graphical Windows installer for you, by passing the bdist_wininst command to your Distutils setup script.
435
c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py bdist_wininst
running bdist_wininst
running build
running build_py
creating build
creating build\lib
creating build\lib\chardet
copying chardet\big5freq.py -> build\lib\chardet
copying chardet\big5prober.py -> build\lib\chardet
...
copying chardet\universaldetector.py -> build\lib\chardet
copying chardet\utf8prober.py -> build\lib\chardet
copying chardet\__init__.py -> build\lib\chardet
installing to build\bdist.win32\wininst
running install_lib
creating build\bdist.win32
creating build\bdist.win32\wininst
creating build\bdist.win32\wininst\PURELIB
creating build\bdist.win32\wininst\PURELIB\chardet
copying build\lib\chardet\big5freq.py -> build\bdist.win32\wininst\PURELIB\chardet
copying build\lib\chardet\big5prober.py -> build\bdist.win32\wininst\PURELIB\chardet
...
copying build\lib\chardet\universaldetector.py -> build\bdist.win32\wininst\PURELIB\chardet
copying build\lib\chardet\utf8prober.py -> build\bdist.win32\wininst\PURELIB\chardet
copying build\lib\chardet\__init__.py -> build\bdist.win32\wininst\PURELIB\chardet
running install_egg_info
Writing build\bdist.win32\wininst\PURELIB\chardet-1.0.2-py3.1.egg-info
creating 'c:\users\pilgrim\appdata\local\temp\tmp2f4h7e.zip' and adding '.' to it
adding 'PURELIB\chardet-1.0.2-py3.1.egg-info'
adding 'PURELIB\chardet\big5freq.py'
adding 'PURELIB\chardet\big5prober.py'
...
adding 'PURELIB\chardet\universaldetector.py'
adding 'PURELIB\chardet\utf8prober.py'
436
adding 'PURELIB\chardet\__init__.py'
removing 'build\bdist.win32\wininst' (and everything under it)
c:\Users\pilgrim\chardet> dir dist
c:\Users\pilgrim\chardet>dir dist
Volume in drive C has no label.
Volume Serial Number is AADE-E29F
Directory of c:\Users\pilgrim\chardet\dist
07/30/2009 10:14 PM <DIR> .
07/30/2009 10:14 PM <DIR> ..
07/30/2009 10:14 PM 371,236 chardet-1.0.2.win32.exe
07/30/2009 06:29 PM 206,440 chardet-1.0.2.zip
2 File(s) 577,676 bytes
2 Dir(s) 61,424,070,656 bytes free
16.9.1. BUILDING INSTALLABLE PACKAGES FOR OTHER OPERATING SYSTEMS
Distutils can help you build installable packages for Linux users. In my opinion, this probably isn’t worth your time. If you want your software distributed for Linux, your time would be better spent working with
community members who specialize in packaging software for major Linux distributions.
For example, my chardet library is in the Debian GNU/Linux repositories (and therefore in the Ubuntu
repositories as well). I had nothing to do with this; the packages just showed up there one day. The Debian community has their own policies for packaging Python libraries, and the Debian python-chardet package is designed to follow these conventions. And since the package lives in Debian’s repositories, Debian users will
receive security updates and/or new versions, depending on the system-wide settings they’ve chosen to
manage their own computers.
The Linux packages that Distutils builds offer none of these advantages. Your time is better spent elsewhere.
⁂
437
16.10. ADDING YOUR SOFTWARE TO THE PYTHON PACKAGE INDEX
Uploading software to the Python Package Index is a three step process.
1. Register yourself
2. Register your software
3. Upload the packages you created with setup.py sdist and setup.py bdist_*
To register yourself, go to the PyPI user registration page. Enter your desired username and password, provide a valid email address, and click the Register button. (If you have a PGP or GPG key, you can also
provide that. If you don’t have one or don’t know what that means, don’t worry about it.) Check your
email; within a few minutes, you should receive a message from PyPI with a validation link. Click the link to
complete the registration process.
Now you need to register your software with PyPI and upload it. You can do this all in one step.
438
c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py register sdist bdist_wininst upload
①
running register
We need to know who you are, so please choose either:
1. use your existing login,
2. register as a new user,
3. have the server generate a new password for you (and email it to you), or
4. quit
Your selection [default 1]:
1
②
Username: MarkPilgrim
③
Password:
Registering chardet to http://pypi.python.org/pypi
④
Server response (200): OK
running sdist
⑤
... output trimmed for brevity ...
running bdist_wininst
⑥
... output trimmed for brevity ...
running upload
⑦
Submitting dist\chardet-1.0.2.zip to http://pypi.python.org/pypi
Server response (200): OK
Submitting dist\chardet-1.0.2.win32.exe to http://pypi.python.org/pypi
Server response (200): OK
I can store your PyPI login so future submissions will be faster.
(the login will be stored in c:\home\.pypirc)
Save your login (y/N)?n
⑧
1. When you release your project for the first time, Distutils will add your software to the Python Package
Index and give it its own URL. Every time after that, it will simply update the project metadata with any
changes you may have made in your setup.py parameters. Next, it builds a source distribution (sdist) and
a Windows installer (bdist_wininst), then uploads them to PyPI (upload).
2. Type 1 or just press ENTER to select “use your existing login.”
3. Enter the username and password you selected on the the PyPI user registration page. Distuils will not echo your password; it will not even echo asterisks in place of characters. Just type your password and press
ENTER.
4. Distutils registers your package with the Python Package Index…
439
5. …builds your source distribution…
6. …builds your Windows installer…
7. …and uploads them both to the Python Package Index.
8. If you want to automate the process of releasing new versions, you need to save your PyPI credentials in a
local file. This is completely insecure and completely optional.
Congratulations, you now have your own page on the Python Package Index! The address is
http://pypi.python.org/pypi/NAME, where NAME is the string you passed in the name parameter in your
setup.py file.
If you want to release a new version, just update your setup.py with the new version number, then run the
same upload command again:
c:\Users\pilgrim\chardet> c:\python31\python.exe setup.py register sdist bdist_wininst upload
⁂
16.11. THE MANY POSSIBLE FUTURES OF PYTHON PACKAGING
Distutils is not the be-all and end-all of Python packaging, but as of this writing (August 2009), it’s the only
packaging framework that works in Python 3. There are a number of other frameworks for Python 2; some
focus on installation, others on testing and deployment. Some or all of these may end up being ported to
Python 3 in the future.
These frameworks focus on installation:
• Pip
These focus on testing and deployment:
440
• Paver
• Fabric
• py2exe
⁂
16.12. FURTHER READING
On Distutils:
• Distributing Python Modules with Distutils
• Core Distutils functionality lists all the possible arguments to the setup() function
• PEP 370: Per user site-packages directory
• PEP 370 and “environment stew”
On other packaging frameworks:
• The Python packaging ecosystem
• A few corrections to “On packaging”
• Python packaging: a few observations
• Nobody expects Python packaging!
441
CHAPTER 17. PORTING CODE TO PYTHON 3 WITH
2to3
❝ Life is pleasant. Death is peaceful. It’s the transition that’s troublesome. ❞
— Isaac Asimov (attributed)
17.1. DIVING IN
SomuchhaschangedbetweenPython2andPython3,therearevanishinglyfewprogramsthatwillrun
unmodified under both. But don’t despair! To help with this transition, Python 3 comes with a utility script
called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to
Python 3. Case study: porting chardet to Python 3 describes how to run the 2to3 script, then shows some things it can’t fix automatically. This appendix documents what it can fix automatically.
17.2. print STATEMENT
In Python 2, print was a statement. Whatever you wanted to print simply followed the print keyword. In
Python 3, print() is a function. Whatever you want to print, pass it to print() like any other function.
Notes
Python 2
Python 3
①
print()
②
print 1
print(1)
③
print 1, 2
print(1, 2)
④
print 1, 2,
print(1, 2, end=' ')
⑤
print >>sys.stderr, 1, 2, 3
print(1, 2, 3, file=sys.stderr)
1. To print a blank line, call print() without any arguments.
2. To print a single value, call print() with one argument.
3. To print two values separated by a space, call print() with two arguments.
442
4. This one is a little tricky. In Python 2, if you ended a print statement with a comma, it would print the
values separated by spaces, then print a trailing space, then stop without printing a carriage return.
(Technically, it’s a little more complicated than that. The print statement in Python 2 used a now-
deprecated attribute called softspace. Instead of printing a space, Python 2 would set
sys.stdout.softspace to 1. The space character wasn’t really printed until something else got printed on
the same line. If the next print statement printed a carriage return, sys.stdout.softspace would be set
to 0 and the space would never be printed. You probably never noticed the difference unless your
application was sensitive to the presence or absence of trailing whitespace in print-generated output.) In
Python 3, the way to do this is to pass end=' ' as a keyword argument to the print() function. The end
argument defaults to '\n' (a carriage return), so overriding it will suppress the carriage return after printing
the other arguments.
5. In Python 2, you could redirect the output to a pipe — like sys.stderr — by using the >>pipe_name
syntax. In Python 3, the way to do this is to pass the pipe in the file keyword argument. The file
argument defaults to sys.stdout (standard out), so overriding it will output to a different pipe instead.
17.3. UNICODE STRING LITERALS
Python 2 had two string types: Unicode strings and non-Unicode strings. Python 3 has one string type:
Notes
Python 2
Python 3
①
u'PapayaWhip'
'PapayaWhip'
②
ur'PapayaWhip\foo'
r'PapayaWhip\foo'
1. Unicode string literals are simply converted into string literals, which, in Python 3, are always Unicode.
2. Unicode raw strings (in which Python does not auto-escape backslashes) are converted to raw strings. In
Python 3, raw strings are always Unicode.
17.4. unicode() GLOBAL FUNCTION
Python 2 had two global functions to coerce objects into strings: unicode() to coerce them into Unicode
strings, and str() to coerce them into non-Unicode strings. Python 3 has only one string type, Unicode
strings, so the str() function is all you need. (The unicode() function no longer exists.)
443
Notes
Python 2
Python 3
unicode(anything)
str(anything)
17.5. long DATA TYPE
Python 2 had separate int and long types for non-floating-point numbers. An int could not be any larger
than sys.maxint, which varied by platform. Longs were defined by appending an L to the end of the
number, and they could be, well, longer than ints. In Python 3, there is only one integer type, called int, which mostly behaves like the long type in Python 2. Since there are no longer two types, there is no need
for special syntax to distinguish them.
Further reading: PEP 237: Unifying Long Integers and Integers.
Notes
Python 2
Python 3
①
x = 1000000000000L
x = 1000000000000
②
x = 0xFFFFFFFFFFFFL
x = 0xFFFFFFFFFFFF
③
long(x)
int(x)
④
type(x) is long
type(x) is int
⑤
isinstance(x, long)
isinstance(x, int)
1. Base 10 long integer literals become base 10 integer literals.
2. Base 16 long integer literals become base 16 integer literals.
3. In Python 3, the old long() function no longer exists, since longs don’t exist. To coerce a variable to an
integer, use the int() function.
4. To check whether a variable is an integer, get its type and compare it to int, not long.
5. You can also use the isinstance() function to check data types; again, use int, not long, to check for
integers.
17.6. <> COMPARISON
Python 2 supported <> as a synonym for !=, the not-equals comparison operator. Python 3 supports the !=
operator, but not <>.
Notes
Python 2
Python 3
444
①
if x <> y:
if x != y:
②
if x <> y <> z:
if x != y != z:
1. A simple comparison.
2. A more complex comparison between three values.
17.7. has_key() DICTIONARY METHOD
In Python 2, dictionaries had a has_key() method to test whether the dictionary had a certain key. In
Python 3, this method no longer exists. Instead, you need to use the in operator.
Notes
Python 2
Python 3
①
a_dictionary.has_key('PapayaWhip')
'PapayaWhip' in a_dictionary
②
a_dictionary.has_key(x) or
x in a_dictionary or y in a_dictionary
a_dictionary.has_key(y)
③
a_dictionary.has_key(x or y)
(x or y) in a_dictionary
④
a_dictionary.has_key(x + y)
(x + y) in a_dictionary
⑤
x + a_dictionary.has_key(y)
x + (y in a_dictionary)
1. The simplest form.
2. The in operator takes precedence over the or operator, so there is no need for parentheses around x in
a_dictionary or around y in a_dictionary.
3. On the other hand, you do need parentheses around x or y here, for the same reason — in takes
precedence over or. (Note: this code is completely different from the previous line. Python interprets x or
y first, which results in either x (if x is true in a boolean context) or y. Then it takes that singular value and checks whether it is a key in a_dictionary.)
4. The + operator takes precedence over the in operator, so this form technically doesn’t need parentheses
around x + y, but 2to3 includes them anyway.
5. This form definitely needs parentheses around y in a_dictionary, since the + operator takes precedence
over the in operator.
445
17.8. DICTIONARY METHODS THAT RETURN LISTS
In Python 2, many dictionary methods returned lists. The most frequently used methods were keys(),
items(), and values(). In Python 3, all of these methods return dynamic views. In some contexts, this is
not a problem. If the method’s return value is immediately passed to another function that iterates through
the entire sequence, it makes no difference whether the actual type is a list or a view. In other contexts, it
matters a great deal. If you were expecting a complete list with individually addressable elements, your code
will choke, because views do not support indexing.
Notes
Python 2
Python 3
①
a_dictionary.keys()
list(a_dictionary.keys())
②
a_dictionary.items()
list(a_dictionary.items())
③
a_dictionary.iterkeys()
iter(a_dictionary.keys())
④
[i for i in a_dictionary.iterkeys()]
[i for i in a_dictionary.keys()]
⑤
min(a_dictionary.keys())
no change
1. 2to3 errs on the side of safety, converting the return value from keys() to a static list with the list()
function. This will always work, but it will be less efficient than using a view. You should examine the
converted code to see if a list is absolutely necessary, or if a view would do.
2. Another view-to-list conversion, with the items() method. 2to3 will do the same thing with the values()
method.
3. Python 3 does not support the iterkeys() method anymore. Use keys(), and if necessary, convert the
view to an iterator with the iter() function.
4. 2to3 recognizes when the iterkeys() method is used inside a list comprehension, and converts it to the
keys() method (without wrapping it in an extra call to iter()). This works because views are iterable.
5. 2to3 recognizes that the keys() method is immediately passed to a function which iterates through an
entire sequence, so there is no need to convert the return value to a list first. The min() function will
happily iterate through the view instead. This applies to min(), max(), sum(), list(), tuple(), set(),
sorted(), any(), and all().
17.9. MODULES THAT HAVE BEEN RENAMED OR REORGANIZED
Several modules in the Python Standard Library have been renamed. Several other modules which are related
to each other have been combined or reorganized to make their association more logical.
446
17.9.1. http
In Python 3, several related HTTP modules have been combined into a single package, http.
Notes
Python 2
Python 3
①
import httplib
import http.client
②
import Cookie
import http.cookies
③
import cookielib
import http.cookiejar
④
import BaseHTTPServer
import http.server
import SimpleHTTPServer
import CGIHttpServer
1. The http.client module implements a low-level library that can request HTTP resources and interpret
H T T P responses.
2. The http.cookies module provides a Pythonic interface to browser cookies that are sent in a Set-Cookie:
H T T P header.
3. The http.cookiejar module manipulates the actual files on disk that popular web browsers use to store
cookies.
4. The http.server module provides a basic HTTP server.
17.9.2. urllib
Python 2 had a rat’s nest of overlapping modules to parse, encode, and fetch URLS. In Python 3, these have
all been refactored and combined in a single package, urllib.
Notes
Python 2
Python 3
①
import urllib
import urllib.request, urllib.parse,
urllib.error
②
import urllib2
import urllib.request, urllib.error
③
import urlparse
import urllib.parse
④
import robotparser
import urllib.robotparser
⑤
from urllib import FancyURLopener
from urllib.request import FancyURLopener
from urllib import urlencode
from urllib.parse import urlencode
447
⑥
from urllib2 import Request
from urllib.request import Request
from urllib2 import HTTPError
from urllib.error import HTTPError
1. The old urllib module in Python 2 had a variety of functions, including urlopen() for fetching data and
splittype(), splithost(), and splituser() for splitting a U R L into its constituent parts. These functions
have been reorganized more logically within the new urllib package. 2to3 will also change all calls to these
functions so they use the new naming scheme.
2. The old urllib2 module in Python 2 has been folded into the urllib package in Python 3. All your
urllib2 favorites — the build_opener() method, Request objects, and HTTPBasicAuthHandler and
friends — are still available.
3. The urllib.parse module in Python 3 contains all the parsing functions from the old urlparse module in
Python 2.
4. The urllib.robotparser module parses robots.txt files.
5. The FancyURLopener class, which handles HTTP redirects and other status codes, is still available in the new
urllib.request module. The urlencode() function has moved to urllib.parse.
6. The Request object is still available in urllib.request, but constants like HTTPError have been moved to
urllib.error.
Did I mention that 2to3 will rewrite your function calls too? For example, if your Python 2 code imports
the urllib module and calls urllib.urlopen() to fetch data, 2to3 will fix both the import statement and
the function call.
Notes Python 2
Python 3
import urllib
import urllib.request, urllib.parse, urllib.error
print urllib.urlopen('http://diveintopython3.org/').read()
print(urllib.request.urlopen('http://diveintopython3.org/').read())
17.9.3. dbm
All the various DBM clones are now in a single package, dbm. If you need a specific variant like GNU DBM,
you can import the appropriate module within the dbm package.
Notes
Python 2
Python 3
import dbm
import dbm.ndbm
import gdbm
import dbm.gnu
448
import dbhash
import dbm.bsd
import dumbdbm
import dbm.dumb
import anydbm
import dbm
import whichdb
17.9.4. xmlrpc
X M L - R P C is a lightweight method of performing remote R P C calls over H T T P . The X M L - R P C client library
and several XML-RPC server implementations are now combined in a single package, xmlrpc.
Notes
Python 2
Python 3
import xmlrpclib
import xmlrpc.client
import DocXMLRPCServer
import xmlrpc.server
import SimpleXMLRPCServer
17.9.5. OTHER MODULES
Notes
Python 2
Python 3
①
try:
import io
import cStringIO as StringIO