Reading and Writing Unicode in 3.0

In fact, we can convert a string to different encodings both manually with method calls and automatically on file input and output. We’ll use the following Unicode string in this section to demonstrate:

C:\misc> c:\python30\python
>>> S = 'A\xc4B\xe8C'           # 5-character string, non-ASCII
>>> S
'AÄBèC'
>>> len(S)
5

广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元

Manual encoding

As we’ve already learned, we can always encode such a string to raw bytes according to the target encoding name:

# Encode manually with methods

>>> L = S.encode('latin-1')     # 5 bytes when encoded as latin-1
>>> L
b'A\xc4B\xe8C'
>>> len(L)
5

>>> U = S.encode('utf-8')       # 7 bytes when encoded as utf-8
>>> U
b'A\xc3\x84B\xc3\xa8C'
>>> len(U)
7

File output encoding

Now, to write our string to a text file in a particular encoding, we can simply pass the desired encoding name to open—although we could manually encode first and write in binary mode, there’s no need to:

# Encoding automatically when written

>>> open('latindata', 'w', encoding='latin-1').write(S)    # Write as latin-1
5
>>> open('utf8data', 'w', encoding='utf-8').write(S)       # Write as utf-8
5

>>> open('latindata', 'rb').read()                         # Read raw bytes
b'A\xc4B\xe8C'

>>> open('utf8data', 'rb').read()                          # Different in files
b'A\xc3\x84B\xc3\xa8C'

File input decoding

Similarly, to read arbitrary Unicode data, we simply pass in the file’s encoding type name to open, and it decodes from raw bytes to strings automatically; we could read raw bytes and decode manually too, but that can be tricky when reading in blocks (we might read an incomplete character), and it isn’t necessary:

# Decoding automatically when read

>>> open('latindata', 'r', encoding='latin-1').read()      # Decoded on input
'AÄBèC'
>>> open('utf8data', 'r', encoding='utf-8').read()         # Per encoding type
'AÄBèC'

>>> X = open('latindata', 'rb').read()                     # Manual decoding:
>>> X.decode('latin-1')                                    # Not necessary
'AÄBèC'
>>> X = open('utf8data', 'rb').read()
>>> X.decode()                                             # UTF-8 is default
'AÄBèC'

Decoding mismatches

Finally, keep in mind that this behavior of files in 3.0 limits the kind of content you can load as text. As suggested in the prior section, Python 3.0 really must be able to decode the data in text files into a str string, according to either the default or a passed-in Unicode encoding name. Trying to open a truly binary data file in text mode, for example, is unlikely to work in 3.0 even if you use the correct object types:

>>> file = open('python.exe', 'r')
>>> text = file.read()
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2: ...

>>> file = open('python.exe', 'rb')
>>> data = file.read()
>>> data[:20]
b'MZ\x90\x00\x03\x00\x00\x00\x04\x00\x00\x00\xff\xff\x00\x00\xb8\x00\x00\x00'

The first of these examples might not fail in Python 2.X (normal files do not decode text), even though it probably should: reading the file may return corrupted data in the string, due to automatic end-of-line translations in text mode (any embedded \r\n bytes will be translated to \n on Windows when read). To treat file content as Unicode text in 2.6, we need to use special tools instead of the general open built-in function, as we’ll see in a moment. First, though, let’s turn to a more explosive topic....