预计阅读本页时间:-
Coding Unicode Strings in Python 2.6
Now that I’ve shown you the basics of Unicode strings in 3.0, I need to explain that you can do much the same in 2.6, though the tools differ. unicode is available in Python 2.6, but it is a distinct data type from str, and it allows free mixing of normal and Unicode strings when they are compatible. In fact, you can essentially pretend 2.6’s str is 3.0’s bytes when it comes to decoding raw bytes into a Unicode string, as long as it’s in the proper form. Here is 2.6 in action; unicode characters display in hex in 2.6 unless you explicitly print, and non-ASCII displays can vary per shell (most of this section ran in IDLE):
C:\misc> c:\python26\python
>>> import sys
>>> sys.version
'2.6 (r26:66721, Oct 2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)]'
>>> S = 'A\xC4B\xE8C' # String of 8-bit bytes
>>> print S # Some are non-ASCII
AÄBèC
>>> S.decode('latin-1') # Decode byte to latin-1 Unicode
u'A\xc4B\xe8C'
>>> S.decode('utf-8') # Not formatted as utf-8
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid data
>>> S.decode('ascii') # Outside ASCII range
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal
not in range(128)
广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元
To store arbitrarily encoded Unicode text, make a unicode object with the u'xxx' literal form (this literal is no longer available in 3.0, since all strings support Unicode in 3.0):
>>> U = u'A\xC4B\xE8C' # Make Unicode string, hex escapes
>>> U
u'A\xc4B\xe8C'
>>> print U
AÄBèC
Once you’ve created it, you can convert Unicode text to different raw byte encodings, similar to encoding str objects into bytes objects in 3.0:
>>> U.encode('latin-1') # Encode per latin-1: 8-bit bytes
'A\xc4B\xe8C'
>>> U.encode('utf-8') # Encode per utf-8: multibyte
'A\xc3\x84B\xc3\xa8C'
Non-ASCII characters can be coded with hex or Unicode escapes in string literals in 2.6, just as in 3.0. However, as with bytes in 3.0, the "\u..." and "\U..." escapes are recognized only for unicode strings in 2.6, not 8-bit str strings:
C:\misc> c:\python26\python
>>> U = u'A\xC4B\xE8C' # Hex escapes for non-ASCII
>>> U
u'A\xc4B\xe8C'
>>> print U
AÄBèC
>>> U = u'A\u00C4B\U000000E8C' # Unicode escapes for non-ASCII
>>> U # u'' = 16 bits, U'' = 32 bits
u'A\xc4B\xe8C'
>>> print U
AÄBèC
>>> S = 'A\xC4B\xE8C' # Hex escapes work
>>> S
'A\xc4B\xe8C'
>>> print S # But some print oddly, unless decoded
A-BFC
>>> print S.decode('latin-1')
AÄBèC
>>> S = 'A\u00C4B\U000000E8C' # Not Unicode escapes: taken literally!
>>> S
'A\\u00C4B\\U000000E8C'
>>> print S
A\u00C4B\U000000E8C
>>> len(S)
19
Like 3.0’s str and bytes, 2.6’s unicode and str share nearly identical operation sets, so unless you need to convert to other encodings you can often treat unicode as though it were str. One of the primary differences between 2.6 and 3.0, though, is that unicode and non-Unicode str objects can be freely mixed in expressions, and as long as the str is compatible with the unicode’s encoding Python will automatically convert it up to unicode (in 3.0, str and bytes never mix automatically and require manual conversions):
>>> u'ab' + 'cd' # Can mix if compatible in 2.6
u'abcd' # 'ab' + b'cd' not allowed in 3.0
In fact, the difference in types is often trivial to your code in 2.6. Like normal strings, Unicode strings may be concatenated, indexed, sliced, matched with the re module, and so on, and they cannot be changed in-place. If you ever need to convert between the two types explicitly, you can use the built-in str and unicode functions:
>>> str(u'spam') # Unicode to normal
'spam'
>>> unicode('spam') # Normal to Unicode
u'spam'
However, this liberal approach to mixing string types in 2.6 only works if the string is compatible with the unicode object’s encoding type:
>>> S = 'A\xC4B\xE8C' # Can't mix if incompatible
>>> U = u'A\xC4B\xE8C'
>>> S + U
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal
not in range(128)
>>> S.decode('latin-1') + U # Manual conversion still required
u'A\xc4B\xe8CA\xc4B\xe8C'
>>> print S.decode('latin-1') + U
AÄBèCAÄBèC
Finally, as we’ll see in more detail later in this chapter, 2.6’s open call supports only files of 8-bit bytes, returning their contents as str strings; it’s up to you to interpret the contents as text or binary data and decode if needed. To read and write Unicode files and encode or decode their content automatically, use 2.6’s codecs.open call, documented in the 2.6 library manual. This call provides much the same functionality as 3.0’s open and uses 2.6 unicode objects to represent file content—reading a file translates encoded bytes into decoded Unicode characters, and writing translates strings to the desired encoding specified when the file is opened.