预计阅读本页时间:-
Python’s String Types
At a more concrete level, the Python language provides string data types to represent character text in your scripts. The string types you will use in your scripts depend upon the version of Python you’re using. Python 2.X has a general string type for representing binary data and simple 8-bit text like ASCII, along with a specific type for representing multibyte Unicode text:
广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元
- str for representing 8-bit text and binary data
- unicode for representing wide-character Unicode text
Python 2.X’s two string types are different (unicode allows for the extra size of characters and has extra support for encoding and decoding), but their operation sets largely overlap. The str string type in 2.X is used for text that can be represented with 8-bit bytes, as well as binary data that represents absolute byte values.
By contrast, Python 3.X comes with three string object types—one for textual data and two for binary data:
- str for representing Unicode text (both 8-bit and wider)
- bytes for representing binary data
- bytearray, a mutable flavor of the bytes type
As mentioned earlier, bytearray is also available in Python 2.6, but it’s simply a back-port from 3.0 with less content-specific behavior and is generally considered a 3.0 type.
All three string types in 3.0 support similar operation sets, but they have different roles. The main goal behind this change in 3.X was to merge the normal and Unicode string types of 2.X into a single string type that supports both normal and Unicode text: developers wanted to remove the 2.X string dichotomy and make Unicode processing more natural. Given that ASCII and other 8-bit text is really a simple kind of Unicode, this convergence seems logically sound.
To achieve this, the 3.0 str type is defined as an immutable sequence of characters (not necessarily bytes), which may be either normal text such as ASCII with one byte per character, or richer character set text such as UTF-8 Unicode that may include multibyte characters. Strings processed by your script with this type are encoded per the platform default, but explicit encoding names may be provided to translate str objects to and from different schemes, both in memory and when transferring to and from files.
While 3.0’s new str type does achieve the desired string/unicode merging, many programs still need to process raw binary data that is not encoded per any text format. Image and audio files, as well as packed data used to interface with devices or C programs you might process with Python’s struct module, fall into this category. To support processing of truly binary data, therefore, a new type, bytes, also was introduced.
In 2.X, the general str type filled this binary data role, because strings were just sequences of bytes (the separate unicode type handles wide-character strings). In 3.0, the bytes type is defined as an immutable sequence of 8-bit integers representing absolute byte values. Moreover, the 3.0 bytes type supports almost all the same operations that the str type does; this includes string methods, sequence operations, and even re module pattern matching, but not string formatting.
A 3.0 bytes object really is a sequence of small integers, each of which is in the range 0 through 255; indexing a bytes returns an int, slicing one returns another bytes, and running the list built-in on one returns a list of integers, not characters. When processed with operations that assume characters, though, the contents of bytes objects are assumed to be ASCII-encoded bytes (e.g., the isalpha method assumes each byte is an ASCII character code). Further, bytes objects are printed as character strings instead of integers for convenience.
While they were at it, Python developers also added a bytearray type in 3.0. bytearray is a variant of bytes that is mutable and so supports in-place changes. It supports the usual string operations that str and bytes do, as well as many of the same in-place change operations as lists (e.g., the append and extend methods, and assignment to indexes). Assuming your strings can be treated as raw bytes, bytearray finally adds direct in-place mutability for string data—something not possible without conversion to a mutable type in Python 2, and not supported by Python 3.0’s str or bytes.
Although Python 2.6 and 3.0 offer much the same functionality, they package it differently. In fact, the mapping from 2.6 to 3.0 string types is not direct—2.6’s str equates to both str and bytes in 3.0, and 3.0’s str equates to both str and unicode in 2.6. Moreover, the mutability of 3.0’s bytearray is unique.
In practice, though, this asymmetry is not as daunting as it might sound. It boils down to the following: in 2.6, you will use str for simple text and binary data and unicode for more advanced forms of text; in 3.0, you’ll use str for any kind of text (simple and Unicode) and bytes or bytearray for binary data. In practice, the choice is often made for you by the tools you use—especially in the case of file processing tools, the topic of the next section.