预计阅读本页时间:-
Text and Binary Files
File I/O (input and output) has also been revamped in 3.0 to reflect the str/bytes distinction and automatically support encoding Unicode text. Python now makes a sharp platform-independent distinction between text files and binary files:
Text files
广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元
When a file is opened in text mode, reading its data automatically decodes its content (per a platform default or a provided encoding name) and returns it as a str; writing takes a str and automatically encodes it before transferring it to the file. Text-mode files also support universal end-of-line translation and additional encoding specification arguments. Depending on the encoding name, text files may also automatically process the byte order mark sequence at the start of a file (more on this momentarily).
Binary files
When a file is opened in binary mode by adding a b (lowercase only) to the mode string argument in the built-in open call, reading its data does not decode it in any way but simply returns its content raw and unchanged, as a bytes object; writing similarly takes a bytes object and transfers it to the file unchanged. Binary-mode files also accept a bytearray object for the content to be written to the file.
Because the language sharply differentiates between str and bytes, you must decide whether your data is text or binary in nature and use either str or bytes objects to represent its content in your script, as appropriate. Ultimately, the mode in which you open a file will dictate which type of object your script will use to represent its content:
- If you are processing image files, packed data created by other programs whose content you must extract, or some device data streams, chances are good that you will want to deal with it using bytes and binary-mode files. You might also opt for bytearray if you wish to update the data without making copies of it in memory.
- If instead you are processing something that is textual in nature, such as program output, HTML, internationalized text, or CSV or XML files, you’ll probably want to use str and text-mode files.
Notice that the mode string argument to built-in function open (its second argument) becomes fairly crucial in Python 3.0—its content not only specifies a file processing mode, but also implies a Python object type. By adding a b to the mode string, you specify binary mode and will receive, or must provide, a bytes object to represent the file’s content when reading or writing. Without the b, your file is processed in text mode, and you’ll use str objects to represent its content in your script. For example, the modes rb, wb, and rb+ imply bytes; r, w+, and rt (the default) imply str.
Text-mode files also handle the byte order marker (BOM) sequence that may appear at the start of files under certain encoding schemes. In the UTF-16 and UTF-32 encodings, for example, the BOM specifies big- or little-endian format (essentially, which end of a bitstring is most significant). A UTF-8 text file may also include a BOM to declare that it is UTF-8 in general, but this isn’t guaranteed. When reading and writing data using these encoding schemes, Python automatically skips or writes the BOM if it is implied by a general encoding name or if you provide a more specific encoding name to force the issue. For example, the BOM is always processed for “utf-16,” the more specific encoding name “utf-16-le” species little-endian UTF-16 format, and the more specific encoding name “utf-8-sig” forces Python to both skip and write a BOM on input and output, respectively, for UTF-8 text (the general name “utf-8” does not).
We’ll learn more about BOMs and files in general in the section Handling the BOM in 3.0. First, let’s explore the implications of Python’s new Unicode string model.