Coding ASCII Text

Let’s step through some examples that demonstrate text coding basics. As we’ve seen, ASCII text is a simple type of Unicode, stored as a sequence of byte values that represent characters:

C:\misc> c:\python30\python

>>> ord('X')             # 'X' has binary value 88 in the default encoding
88
>>> chr(88)              # 88 stands for character 'X'
'X'

>>> S = 'XYZ'            # A Unicode string of ASCII text
>>> S
'XYZ'
>>> len(S)               # 3 characters long
3
>>> [ord(c) for c in S]  # 3 bytes with integer ordinal values
[88, 89, 90]

广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元

Normal 7-bit ASCII text like this is represented with one character per byte under each of the Unicode encoding schemes described earlier in this chapter:

>>> S.encode('ascii')    # Values 0..127 in 1 byte (7 bits) each
b'XYZ'
>>> S.encode('latin-1')  # Values 0..255 in 1 byte (8 bits) each
b'XYZ'
>>> S.encode('utf-8')    # Values 0..127 in 1 byte, 128..2047 in 2, others 3 or 4
b'XYZ'

In fact, the bytes objects returned by encoding ASCII text this way is really a sequence of short integers, which just happen to print as ASCII characters when possible:

>>> S.encode('latin-1')[0]
88
>>> list(S.encode('latin-1'))
[88, 89, 90]