Chapter 36. Unicode and Byte Strings

In the strings chapter in the core types part of this book (Chapter 7), I deliberately limited the scope to the subset of string topics that most Python programmers need to know about. Because the vast majority of programmers deal with simple forms of text like ASCII, they can happily work with Python’s basic str string type and its associated operations and don’t need to come to grips with more advanced string concepts. In fact, such programmers can largely ignore the string changes in Python 3.0 and continue to use strings as they may have in the past.

On the other hand, some programmers deal with more specialized types of data: non-ASCII character sets, image file contents, and so on. For those programmers (and others who may join them some day), in this chapter we’re going to fill in the rest of the Python string story and look at some more advanced concepts in Python’s string model.

广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元

Specifically, we’ll explore the basics of Python’s support for Unicode text—wide-character strings used in internationalized applications—as well as binary data—strings that represent absolute byte values. As we’ll see, the advanced string representation story has diverged in recent versions of Python:

 

 
  • Python 3.0 provides an alternative string type for binary data and supports Unicode text in its normal string type (ASCII is treated as a simple type of Unicode).
  • Python 2.6 provides an alternative string type for non-ASCII Unicode text and supports both simple text and binary data in its normal string type.

In addition, because Python’s string model has a direct impact on how you process non-ASCII files, we’ll explore the fundamentals of that related topic here as well. Finally, we’ll take a brief look at some advanced string and binary tools, such as pattern matching, object pickling, binary data packing, and XML parsing, and the ways in which they are impacted by 3.0’s string changes.

This is officially an advanced topics chapter, because not all programmers will need to delve into the worlds of Unicode encodings or binary data. If you ever need to care about processing either of these, though, you’ll find that Python’s string models provide the support you need.