第815页 | Learning Python-Mark Lutz | 阅读 ‧ 电子书库

同步阅读进度，多语言翻译，过滤屏幕蓝光，评论分享，更多完整功能，更好读书体验，试试阅读 ‧ 电子书库

Encoding and Decoding Non-ASCII text

Now, if we try to encode a non-ASCII string into raw bytes using as ASCII, we’ll get an error. Encoding as Latin-1 works, though, and allocates one byte per character; encoding as UTF-8 allocates 2 bytes per character instead. If you write this string to a file, the raw bytes shown here is what is actually stored on the file for the encoding types given:

>>> S = '\u00c4\u00e8'
>>> S
'Äè'
>>> len(S)
2

>>> S.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1:
ordinal not in range(128)

>>> S.encode('latin-1')   # One byte per character
b'\xc4\xe8'

>>> S.encode('utf-8')   # Two bytes per character
b'\xc3\x84\xc3\xa8'

>>> len(S.encode('latin-1'))   # 2 bytes in latin-1, 4 in utf-8
2
>>> len(S.encode('utf-8'))
4

Note that you can also go the other way, reading raw bytes from a file and decoding them back to a Unicode string. However, as we’ll see later, the encoding mode you give to the open call causes this decoding to be done for you automatically on input (and avoids issues that may arise from reading partial character sequences when reading by blocks of bytes):

>>> B = b'\xc4\xe8'
>>> B
b'\xc4\xe8'
>>> len(B)   # 2 raw bytes, 2 characters
2
>>> B.decode('latin-1')   # Decode to latin-1 text
'Äè'

>>> B = b'\xc3\x84\xc3\xa8'
>>> len(B)   # 4 raw bytes
4
>>> B.decode('utf-8')
'Äè'
>>> len(B.decode('utf-8'))   # 2 Unicode characters
2

请支持我们，让我们可以支付服务器费用。
使用微信支付打赏

下载 · 书页 · 阅读 ‧ 电子书库

第815页 | Learning Python | 阅读 ‧ 电子书库