Other Unicode Coding Techniques

Some encodings use even larger byte sequences to represent characters. When needed, you can specify both 16- and 32-bit Unicode values for characters in your strings—use "\u..." with four hex digits for the former, and "\U...." with eight hex digits for the latter:

>>> S = 'A\u00c4B\U000000e8C'
>>> S                                # A, B, C, and 2 non-ASCII characters
'AÄBèC'
>>> len(S)                           # 5 characters long
5

>>> S.encode('latin-1')
b'A\xc4B\xe8C'
>>> len(S.encode('latin-1'))         # 5 bytes in latin-1
5

>>> S.encode('utf-8')
b'A\xc3\x84B\xc3\xa8C'
>>> len(S.encode('utf-8'))           # 7 bytes in utf-8
7

广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元

Interestingly, some other encodings may use very different byte formats. The cp500 EBCDIC encoding, for example, doesn’t even encode ASCII the same way as the encodings we’ve been using so far (since Python encodes and decodes for us, we only generally need to care about this when providing encoding names):

>>> S
'AÄBèC'
>>> S.encode('cp500')                # Two other Western European encodings
b'\xc1c\xc2T\xc3'
>>> S.encode('cp850')                # 5 bytes each
b'A\x8eB\x8aC'

>>> S = 'spam'                       # ASCII text is the same in most
>>> S.encode('latin-1')
b'spam'
>>> S.encode('utf-8')
b'spam'
>>> S.encode('cp500')                # But not in cp500: IBM EBCDIC!
b'\xa2\x97\x81\x94'
>>> S.encode('cp850')
b'spam'

Technically speaking, you can also build Unicode strings piecemeal using chr instead of Unicode or hex escapes, but this might become tedious for large strings:

>>> S = 'A' + chr(0xC4) + 'B' + chr(0xE8) + 'C'
>>> S
'AÄBèC'

Two cautions here. First, Python 3.0 allows special characters to be coded with both hex and Unicode escapes in str strings, but only with hex escapes in bytes strings—Unicode escape sequences are silently taken verbatim in bytes literals, not as escapes. In fact, bytes must be decoded to str strings to print their non-ASCII characters properly:

>>> S = 'A\xC4B\xE8C'                # str recognizes hex and Unicode escapes
>>> S
'AÄBèC'

>>> S = 'A\u00C4B\U000000E8C'
>>> S
'AÄBèC'

>>> B = b'A\xC4B\xE8C'               # bytes recognizes hex but not Unicode
>>> B
b'A\xc4B\xe8C'

>>> B = b'A\u00C4B\U000000E8C'       # Escape sequences taken literally!
>>> B
b'A\\u00C4B\\U000000E8C'

>>> B = b'A\xC4B\xE8C'               # Use hex escapes for bytes
>>> B                                # Prints non-ASCII as hex
b'A\xc4B\xe8C'
>>> print(B)
b'A\xc4B\xe8C'
>>> B.decode('latin-1')              # Decode as latin-1 to interpret as text
'AÄBèC'

Second, bytes literals require characters either to be either ASCII characters or, if their values are greater than 127, to be escaped; str stings, on the other hand, allow literals containing any character in the source character set (which, as discussed later, defaults to UTF-8 unless an encoding declaration is given in the source file):

>>> S = 'AÄBèC'                      # Chars from UTF-8 if no encoding declaration
>>> S
'AÄBèC'

>>> B = b'AÄBèC'
SyntaxError: bytes can only contain ASCII literal characters.

>>> B = b'A\xC4B\xE8C'               # Chars must be ASCII, or escapes
>>> B
b'A\xc4B\xe8C'
>>> B.decode('latin-1')
'AÄBèC'

>>> S.encode()                       # Source code encoded per UTF-8 by default
b'A\xc3\x84B\xc3\xa8C'               # Uses system default to encode, unless passed
>>> S.encode('utf-8')
b'A\xc3\x84B\xc3\xa8C'

>>> B.decode()                       # Raw bytes do not correspond to utf-8
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: ...