Encoding and Decoding Non-ASCII text

Now, if we try to encode a non-ASCII string into raw bytes using as ASCII, we’ll get an error. Encoding as Latin-1 works, though, and allocates one byte per character; encoding as UTF-8 allocates 2 bytes per character instead. If you write this string to a file, the raw bytes shown here is what is actually stored on the file for the encoding types given:

>>> S = '\u00c4\u00e8'
>>> S
'Äè'
>>> len(S)
2

>>> S.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1:
ordinal not in range(128)

>>> S.encode('latin-1')              # One byte per character
b'\xc4\xe8'

>>> S.encode('utf-8')                # Two bytes per character
b'\xc3\x84\xc3\xa8'

>>> len(S.encode('latin-1'))         # 2 bytes in latin-1, 4 in utf-8
2
>>> len(S.encode('utf-8'))
4

广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元

Note that you can also go the other way, reading raw bytes from a file and decoding them back to a Unicode string. However, as we’ll see later, the encoding mode you give to the open call causes this decoding to be done for you automatically on input (and avoids issues that may arise from reading partial character sequences when reading by blocks of bytes):

>>> B = b'\xc4\xe8'
>>> B
b'\xc4\xe8'
>>> len(B)                           # 2 raw bytes, 2 characters
2
>>> B.decode('latin-1')              # Decode to latin-1 text
'Äè'

>>> B = b'\xc3\x84\xc3\xa8'
>>> len(B)                           # 4 raw bytes
4
>>> B.decode('utf-8')
'Äè'
>>> len(B.decode('utf-8'))           # 2 Unicode characters
2