预计阅读本页时间:-
2.5.2. ASSIGNING MULTIPLE VALUES AT ONCE
Here’s a cool programming shortcut: in Python, you can use a tuple to assign multiple values at once.
>>> v = ('a', 2, True)
广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元
>>> (x, y, z) = v
①
>>> x
'a'
>>> y
2
>>> z
True
1. v is a tuple of three elements, and (x, y, z) is a tuple of three variables. Assigning one to the other
assigns each of the values of v to each of the variables, in order.
This has all kinds of uses. Suppose you want to assign names to a range of values. You can use the built-in
range() function with multi-variable assignment to quickly assign consecutive values.
>>> (MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY) = range(7)
①
>>> MONDAY
②
0
>>> TUESDAY
1
>>> SUNDAY
6
1. The built-in range() function constructs a sequence of integers. (Technically, the range() function returns
an iterator, not a list or a tuple, but you’ll learn about that distinction later.) MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, and SUNDAY are the variables you’re defining. (This example came from the
calendar module, a fun little module that prints calendars, like the U N I X program cal. The calendar
module defines integer constants for days of the week.)
2. Now each variable has its value: MONDAY is 0, TUESDAY is 1, and so forth.
74
You can also use multi-variable assignment to build functions that return multiple values, simply by returning
a tuple of all the values. The caller can treat it as a single tuple, or it can assign the values to individual
variables. Many standard Python libraries do this, including the os module, which you'll learn about in the
⁂
2.6. SETS
A set is an unordered “bag” of unique values. A single set can contain values of any immutable datatype.
Once you have two sets, you can do standard set operations like union, intersection, and set difference.
2.6.1. CREATING A SET
First things first. Creating a set is easy.
>>> a_set = {1}
①
>>> a_set
{1}
>>> type(a_set)
②
<class 'set'>
>>> a_set = {1, 2}
③
>>> a_set
{1, 2}
1. To create a set with one value, put the value in curly brackets ({}).
2. Sets are actually implemented as classes, but don’t worry about that for now.
3. To create a set with multiple values, separate the values with commas and wrap it all up with curly brackets.
You can also create a set out of a list.
75
>>> a_list = ['a', 'b', 'mpilgrim', True, False, 42]
>>> a_set = set(a_list)
①
>>> a_set
②
{'a', False, 'b', True, 'mpilgrim', 42}
>>> a_list
③
['a', 'b', 'mpilgrim', True, False, 42]
1. To create a set from a list, use the set() function. (Pedants who know about how sets are implemented
will point out that this is not really calling a function, but instantiating a class. I promise you will learn the
difference later in this book. For now, just know that set() acts like a function, and it returns a set.)
2. As I mentioned earlier, a single set can contain values of any datatype. And, as I mentioned earlier, sets are
unordered. This set does not remember the original order of the list that was used to create it. If you were
to add items to this set, it would not remember the order in which you added them.
3. The original list is unchanged.
Don’t have any values yet? Not a problem. You can create an empty set.
>>> a_set = set()
①
>>> a_set
②
set()
>>> type(a_set)
③
<class 'set'>
>>> len(a_set)
④
0
>>> not_sure = {}
⑤
>>> type(not_sure)
<class 'dict'>
1. To create an empty set, call set() with no arguments.
2. The printed representation of an empty set looks a bit strange. Were you expecting {}, perhaps? That
would denote an empty dictionary, not an empty set. You’ll learn about dictionaries later in this chapter.
3. Despite the strange printed representation, this is a set…
4. …and this set has no members.
76
5. Due to historical quirks carried over from Python 2, you can not create an empty set with two curly
brackets. This actually creates an empty dictionary, not an empty set.
2.6.2. MODIFYING A SET
There are two different ways to add values to an existing set: the add() method, and the update() method.
>>> a_set = {1, 2}
>>> a_set.add(4)
①
>>> a_set
{1, 2, 4}
>>> len(a_set)
②
3
>>> a_set.add(1)
③
>>> a_set
{1, 2, 4}
>>> len(a_set)
④
3
1. The add() method takes a single argument, which can be any datatype, and adds the given value to the set.
2. This set now has 3 members.
3. Sets are bags of unique values. If you try to add a value that already exists in the set, it will do nothing. It
won’t raise an error; it’s just a no-op.
4. This set still has 3 members.
77
>>> a_set = {1, 2, 3}
>>> a_set
{1, 2, 3}
>>> a_set.update({2, 4, 6})
①
>>> a_set
②
{1, 2, 3, 4, 6}
>>> a_set.update({3, 6, 9}, {1, 2, 3, 5, 8, 13})
③
>>> a_set
{1, 2, 3, 4, 5, 6, 8, 9, 13}
>>> a_set.update([10, 20, 30])
④
>>> a_set
{1, 2, 3, 4, 5, 6, 8, 9, 10, 13, 20, 30}
1. The update() method takes one argument, a set, and adds all its members to the original set. It’s as if you
called the add() method with each member of the set.
2. Duplicate values are ignored, since sets can not contain duplicates.
3. You can actually call the update() method with any number of arguments. When called with two sets, the
update() method adds all the members of each set to the original set (dropping duplicates).
4. The update() method can take objects of a number of different datatypes, including lists. When called with
a list, the update() method adds all the items of the list to the original set.
2.6.3. REMOVING ITEMS FROM A SET
There are three ways to remove individual values from a set. The first two, discard() and remove(), have
one subtle difference.
78
>>> a_set = {1, 3, 6, 10, 15, 21, 28, 36, 45}
>>> a_set
{1, 3, 36, 6, 10, 45, 15, 21, 28}
>>> a_set.discard(10)
①
>>> a_set
{1, 3, 36, 6, 45, 15, 21, 28}
>>> a_set.discard(10)
②
>>> a_set
{1, 3, 36, 6, 45, 15, 21, 28}
>>> a_set.remove(21)
③
>>> a_set
{1, 3, 36, 6, 45, 15, 28}
>>> a_set.remove(21)
④
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 21
1. The discard() method takes a single value as an argument and removes that value from the set.
2. If you call the discard() method with a value that doesn’t exist in the set, it does nothing. No error; it’s
just a no-op.
3. The remove() method also takes a single value as an argument, and it also removes that value from the set.
4. Here’s the difference: if the value doesn’t exist in the set, the remove() method raises a KeyError
exception.
Like lists, sets have a pop() method.
79
>>> a_set = {1, 3, 6, 10, 15, 21, 28, 36, 45}
>>> a_set.pop()
①
1
>>> a_set.pop()
3
>>> a_set.pop()
36
>>> a_set
{6, 10, 45, 15, 21, 28}
>>> a_set.clear()
②
>>> a_set
set()
>>> a_set.pop()
③
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'pop from an empty set'
1. The pop() method removes a single value from a set and returns the value. However, since sets are
unordered, there is no “last” value in a set, so there is no way to control which value gets removed. It is
completely arbitrary.
2. The clear() method removes all values from a set, leaving you with an empty set. This is equivalent to
a_set = set(), which would create a new empty set and overwrite the previous value of the a_set
variable.
3. Attempting to pop a value from an empty set will raise a KeyError exception.
2.6.4. COMMON SET OPERATIONS
Python’s set type supports several common set operations.
80
>>> a_set = {2, 4, 5, 9, 12, 21, 30, 51, 76, 127, 195}
>>> 30 in a_set
①
True
>>> 31 in a_set
False
>>> b_set = {1, 2, 3, 5, 6, 8, 9, 12, 15, 17, 18, 21}
>>> a_set.union(b_set)
②
{1, 2, 195, 4, 5, 6, 8, 12, 76, 15, 17, 18, 3, 21, 30, 51, 9, 127}
>>> a_set.intersection(b_set)
③
{9, 2, 12, 5, 21}
>>> a_set.difference(b_set)
④
{195, 4, 76, 51, 30, 127}
>>> a_set.symmetric_difference(b_set)
⑤
{1, 3, 4, 6, 8, 76, 15, 17, 18, 195, 127, 30, 51}
1. To test whether a value is a member of a set, use the in operator. This works the same as lists.
2. The union() method returns a new set containing all the elements that are in either set.
3. The intersection() method returns a new set containing all the elements that are in both sets.
4. The difference() method returns a new set containing all the elements that are in a_set but not b_set.
5. The symmetric_difference() method returns a new set containing all the elements that are in exactly one
of the sets.
Three of these methods are symmetric.
81
# continued from the previous example
>>> b_set.symmetric_difference(a_set)
①
{3, 1, 195, 4, 6, 8, 76, 15, 17, 18, 51, 30, 127}
>>> b_set.symmetric_difference(a_set) == a_set.symmetric_difference(b_set)
②
True
>>> b_set.union(a_set) == a_set.union(b_set)
③
True
>>> b_set.intersection(a_set) == a_set.intersection(b_set)
④
True
>>> b_set.difference(a_set) == a_set.difference(b_set)
⑤
False
1. The symmetric difference of a_set from b_set looks different than the symmetric difference of b_set from
a_set, but remember, sets are unordered. Any two sets that contain all the same values (with none left
over) are considered equal.
2. And that’s exactly what happens here. Don’t be fooled by the Python Shell’s printed representation of these
sets. They contain the same values, so they are equal.
3. The union of two sets is also symmetric.
4. The intersection of two sets is also symmetric.
5. The difference of two sets is not symmetric. That makes sense; it’s analogous to subtracting one number
from another. The order of the operands matters.
Finally, there are a few questions you can ask of sets.
82
>>> a_set = {1, 2, 3}
>>> b_set = {1, 2, 3, 4}
>>> a_set.issubset(b_set)
①
True
>>> b_set.issuperset(a_set)
②
True
>>> a_set.add(5)
③
>>> a_set.issubset(b_set)
False
>>> b_set.issuperset(a_set)
False
1. a_set is a subset of b_set — all the members of a_set are also members of b_set.
2. Asking the same question in reverse, b_set is a superset of a_set, because all the members of a_set are
also members of b_set.
3. As soon as you add a value to a_set that is not in b_set, both tests return False.
2.6.5. SETS IN A BOOLEAN CONTEXT
You can use sets in a boolean context, such as an if statement.
>>> def is_it_true(anything):
...
if anything:
...
print("yes, it's true")
...
else:
...
print("no, it's false")
...
>>> is_it_true(set())
①
no, it's false
>>> is_it_true({'a'})
②
yes, it's true
>>> is_it_true({False})
③
yes, it's true
83
1. In a boolean context, an empty set is false.
2. Any set with at least one item is true.
3. Any set with at least one item is true. The value of the items is irrelevant.
⁂
2.7. DICTIONARIES
A dictionary is an unordered set of key-value pairs. When you add a key to a dictionary, you must also add
a value for that key. (You can always change the value later.) Python dictionaries are optimized for retrieving
the value when you know the key, but not the other way around.
☞ A dictionary in Python is like a hash in Perl 5. In Perl 5, variables that store hashes
always start with a % character. In Python, variables can be named anything, and
Python keeps track of the datatype internally.
2.7.1. CREATING A DICTIONARY
Creating a dictionary is easy. The syntax is similar to sets, but instead of values, you have key-value pairs.
Once you have a dictionary, you can look up values by their key.
84
>>> a_dict = {'server': 'db.diveintopython3.org', 'database': 'mysql'}
①
>>> a_dict
{'server': 'db.diveintopython3.org', 'database': 'mysql'}
>>> a_dict['server']
②
'db.diveintopython3.org'
>>> a_dict['database']
③
'mysql'
>>> a_dict['db.diveintopython3.org']
④
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'db.diveintopython3.org'
1. First, you create a new dictionary with two items and assign it to the variable a_dict. Each item is a key-
value pair, and the whole set of items is enclosed in curly braces.
2. 'server' is a key, and its associated value, referenced by a_dict['server'], is
'db.diveintopython3.org'.
3. 'database' is a key, and its associated value, referenced by a_dict['database'], is 'mysql'.
4. You can get values by key, but you can’t get keys by value. So a_dict['server'] is
'db.diveintopython3.org', but a_dict['db.diveintopython3.org'] raises an exception, because
'db.diveintopython3.org' is not a key.
2.7.2. MODIFYING A DICTIONARY
Dictionaries do not have any predefined size limit. You can add new key-value pairs to a dictionary at any
time, or you can modify the value of an existing key. Continuing from the previous example:
85
>>> a_dict
{'server': 'db.diveintopython3.org', 'database': 'mysql'}
>>> a_dict['database'] = 'blog'
①
>>> a_dict
{'server': 'db.diveintopython3.org', 'database': 'blog'}
>>> a_dict['user'] = 'mark'
②
>>> a_dict
③
{'server': 'db.diveintopython3.org', 'user': 'mark', 'database': 'blog'}
>>> a_dict['user'] = 'dora'
④
>>> a_dict
{'server': 'db.diveintopython3.org', 'user': 'dora', 'database': 'blog'}
>>> a_dict['User'] = 'mark'
⑤
>>> a_dict
{'User': 'mark', 'server': 'db.diveintopython3.org', 'user': 'dora', 'database': 'blog'}
1. You can not have duplicate keys in a dictionary. Assigning a value to an existing key will wipe out the old
value.
2. You can add new key-value pairs at any time. This syntax is identical to modifying existing values.
3. The new dictionary item (key 'user', value 'mark') appears to be in the middle. In fact, it was just a
coincidence that the items appeared to be in order in the first example; it is just as much a coincidence that
they appear to be out of order now.
4. Assigning a value to an existing dictionary key simply replaces the old value with the new one.
5. Will this change the value of the user key back to "mark"? No! Look at the key closely — that’s a capital U
in "User". Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not
overwriting an existing one. It may look similar to you, but as far as Python is concerned, it’s completely
different.
2.7.3. MIXED-VALUE DICTIONARIES
Dictionaries aren’t just for strings. Dictionary values can be any datatype, including integers, booleans,
arbitrary objects, or even other dictionaries. And within a single dictionary, the values don’t all need to be
the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be
strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary.
86
In fact, you’ve already seen a dictionary with non-string keys and values, in your first Python program.
SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
Let's tear that apart in the interactive shell.
>>> SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
...
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
>>> len(SUFFIXES)
①
2
>>> 1000 in SUFFIXES
②
True
>>> SUFFIXES[1000]
③
['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']
>>> SUFFIXES[1024]
④
['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']
>>> SUFFIXES[1000][3]
⑤
'TB'
1. Like lists and sets, the len() function gives you the number of keys in a dictionary.
2. And like lists and sets, you can use the in operator to test whether a specific key is defined in a dictionary.
3. 1000 is a key in the SUFFIXES dictionary; its value is a list of eight items (eight strings, to be precise).
4. Similarly, 1024 is a key in the SUFFIXES dictionary; its value is also a list of eight items.
5. Since SUFFIXES[1000] is a list, you can address individual items in the list by their 0-based index.
87
2.7.4. DICTIONARIES IN A BOOLEAN CONTEXT
You can also use a dictionary in a boolean context, such
as an if statement.
>>> def is_it_true(anything):
...
if anything:
Empty
...
print("yes, it's true")
...
else:
dictionaries
...
print("no, it's false")
...
are false; all
>>> is_it_true({})
①
no, it's false
other
>>> is_it_true({'a': 1})
②
yes, it's true
dictionaries
1. In a boolean context, an empty dictionary is false.
are true.
2. Any dictionary with at least one key-value pair is true.
⁂
2.8. None
None is a special constant in Python. It is a null value. None is not the same as False. None is not 0. None is
not an empty string. Comparing None to anything other than None will always return False.
None is the only null value. It has its own datatype (NoneType). You can assign None to any variable, but you
can not create other NoneType objects. All variables whose value is None are equal to each other.
88
>>> type(None)
<class 'NoneType'>
>>> None == False
False
>>> None == 0
False
>>> None == ''
False
>>> None == None
True
>>> x = None
>>> x == None
True
>>> y = None
>>> x == y
True
2.8.1. None IN A BOOLEAN CONTEXT
In a boolean context, None is false and not None is true.
>>> def is_it_true(anything):
...
if anything:
...
print("yes, it's true")
...
else:
...
print("no, it's false")
...
>>> is_it_true(None)
no, it's false
>>> is_it_true(not None)
yes, it's true
⁂
89
2.9. FURTHER READING
• PEP 237: Unifying Long Integers and Integers
• PEP 238: Changing the Division Operator
90
CHAPTER 3. COMPREHENSIONS
❝ Our imagination is stretched to the utmost, not, as in fiction, to imagine things which are not really there, but just
to comprehend those things which are. ❞
3.1. DIVING IN
Everyprogramminglanguagehasthatonefeature,acomplicatedthingintentionallymadesimple.If
you’re coming from another language, you could easily miss it, because your old language didn’t make that
thing simple (because it was busy making something else simple instead). This chapter will teach you about
list comprehensions, dictionary comprehensions, and set comprehensions: three related concepts centered
around one very powerful technique. But first, I want to take a little detour into two modules that will help
you navigate your local file system.
⁂
3.2. WORKING WITH FILES AND DIRECTORIES
Python 3 comes with a module called os, which stands for “operating system.” The os module contains a plethora of functions to get information on — and in some cases, to manipulate — local directories, files,
processes, and environment variables. Python does its best to offer a unified API across all supported
operating systems so your programs can run on any computer with as little platform-specific code as
possible.
91
3.2.1. THE CURRENT WORKING DIRECTORY
When you’re just getting started with Python, you’re going to spend a lot of time in the Python Shell.
Throughout this book, you will see examples that go like this:
1. Import one of the modules in the examples folder
2. Call a function in that module
3. Explain the result
If you don’t know about the current working directory,
step 1 will probably fail with an ImportError. Why?
Because Python will look for the example module in the
import search path, but it won’t find it because the
examples folder isn’t one of the directories in the
There is
search path. To get past this, you can do one of two
things:
always a
1. Add the examples folder to the import search path
current
2. Change the current working directory to the examples
folder
working
The current working directory is an invisible property
directory.
that Python holds in memory at all times. There is
always a current working directory, whether you’re in
the Python Shell, running your own Python script from
the command line, or running a Python CGI script on a
web server somewhere.
The os module contains two functions to deal with the current working directory.
92
>>> import os
①
>>> print(os.getcwd())
②
C:\Python31
>>> os.chdir('/Users/pilgrim/diveintopython3/examples')
③
>>> print(os.getcwd())
④
C:\Users\pilgrim\diveintopython3\examples
1. The os module comes with Python; you can import it anytime, anywhere.
2. Use the os.getcwd() function to get the current working directory. When you run the graphical Python
Shell, the current working directory starts as the directory where the Python Shell executable is. On
Windows, this depends on where you installed Python; the default directory is c:\Python31. If you run the
Python Shell from the command line, the current working directory starts as the directory you were in
when you ran python3.
3. Use the os.chdir() function to change the current working directory.
4. When I called the os.chdir() function, I used a Linux-style pathname (forward slashes, no drive letter) even
though I’m on Windows. This is one of the places where Python tries to paper over the differences between
operating systems.
3.2.2. WORKING WITH FILENAMES AND DIRECTORY NAMES
While we’re on the subject of directories, I want to point out the os.path module. os.path contains
functions for manipulating filenames and directory names.
>>> import os
>>> print(os.path.join('/Users/pilgrim/diveintopython3/examples/', 'humansize.py'))
①
/Users/pilgrim/diveintopython3/examples/humansize.py
>>> print(os.path.join('/Users/pilgrim/diveintopython3/examples', 'humansize.py'))
②
/Users/pilgrim/diveintopython3/examples\humansize.py
>>> print(os.path.expanduser('~'))
③
c:\Users\pilgrim
>>> print(os.path.join(os.path.expanduser('~'), 'diveintopython3', 'examples', 'humansize.py'))
④
c:\Users\pilgrim\diveintopython3\examples\humansize.py
93
1. The os.path.join() function constructs a pathname out of one or more partial pathnames. In this case, it
simply concatenates strings.
2. In this slightly less trivial case, calling the os.path.join() function will add an extra slash to the pathname
before joining it to the filename. It’s a backslash instead of a forward slash, because I constructed this
example on Windows. If you replicate this example on Linux or Mac OS X, you’ll see a forward slash
instead. Don’t fuss with slashes; always use os.path.join() and let Python do the right thing.
3. The os.path.expanduser() function will expand a pathname that uses ~ to represent the current user’s
home directory. This works on any platform where users have a home directory, including Linux, Mac OS X,
and Windows. The returned path does not have a trailing slash, but the os.path.join() function doesn’t
mind.
4. Combining these techniques, you can easily construct pathnames for directories and files in the user’s home
directory. The os.path.join() function can take any number of arguments. I was overjoyed when I
discovered this, since addSlashIfNecessary() is one of the stupid little functions I always need to write
when building up my toolbox in a new language. Do not write this stupid little function in Python; smart
people have already taken care of it for you.
os.path also contains functions to split full pathnames, directory names, and filenames into their constituent
parts.
>>> pathname = '/Users/pilgrim/diveintopython3/examples/humansize.py'
>>> os.path.split(pathname)
①
('/Users/pilgrim/diveintopython3/examples', 'humansize.py')
>>> (dirname, filename) = os.path.split(pathname)
②
>>> dirname
③
'/Users/pilgrim/diveintopython3/examples'
>>> filename
④
'humansize.py'
>>> (shortname, extension) = os.path.splitext(filename)
⑤
>>> shortname
'humansize'
>>> extension
'.py'
1. The split function splits a full pathname and returns a tuple containing the path and filename.
94
2. Remember when I said you could use multi-variable assignment to return multiple values from a function?
The os.path.split() function does exactly that. You assign the return value of the split function into a
tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple.
3. The first variable, dirname, receives the value of the first element of the tuple returned from the
os.path.split() function, the file path.
4. The second variable, filename, receives the value of the second element of the tuple returned from the
os.path.split() function, the filename.
5. os.path also contains the os.path.splitext() function, which splits a filename and returns a tuple
containing the filename and the file extension. You use the same technique to assign each of them to
separate variables.
3.2.3. LISTING DIRECTORIES
The glob module is another tool in the Python standard library. It’s an easy way to get the contents of a
directory programmatically, and it uses the sort of wildcards that you may already be familiar with from
working on the command line.
The glob
module uses
shell-like
wildcards.
95
>>> os.chdir('/Users/pilgrim/diveintopython3/')
>>> import glob
>>> glob.glob('examples/*.xml')
①
['examples\\feed-broken.xml',
'examples\\feed-ns0.xml',
'examples\\feed.xml']
>>> os.chdir('examples/')
②
>>> glob.glob('*test*.py')
③
['alphameticstest.py',
'pluraltest1.py',
'pluraltest2.py',
'pluraltest3.py',
'pluraltest4.py',
'pluraltest5.py',
'pluraltest6.py',
'romantest1.py',
'romantest10.py',
'romantest2.py',
'romantest3.py',
'romantest4.py',
'romantest5.py',
'romantest6.py',
'romantest7.py',
'romantest8.py',
'romantest9.py']
1. The glob module takes a wildcard and returns the path of all files and directories matching the wildcard. In
this example, the wildcard is a directory path plus “*.xml”, which will match all .xml files in the examples
subdirectory.
2. Now change the current working directory to the examples subdirectory. The os.chdir() function can
take relative pathnames.
3. You can include multiple wildcards in your glob pattern. This example finds all the files in the current
working directory that end in a .py extension and contain the word test anywhere in their filename.
96
3.2.4. GETTING FILE METADATA
Every modern file system stores metadata about each file: creation date, last-modified date, file size, and so
on. Python provides a single API to access this metadata. You don’t need to open the file; all you need is
the filename.
>>> import os
>>> print(os.getcwd())
①
c:\Users\pilgrim\diveintopython3\examples
>>> metadata = os.stat('feed.xml')
②
>>> metadata.st_mtime
③
1247520344.9537716
>>> import time
④
>>> time.localtime(metadata.st_mtime)
⑤
time.struct_time(tm_year=2009, tm_mon=7, tm_mday=13, tm_hour=17,
tm_min=25, tm_sec=44, tm_wday=0, tm_yday=194, tm_isdst=1)
1. The current working directory is the examples folder.
2. feed.xml is a file in the examples folder. Calling the os.stat() function returns an object that contains
several different types of metadata about the file.
3. st_mtime is the modification time, but it’s in a format that isn’t terribly useful. (Technically, it’s the number
of seconds since the Epoch, which is defined as the first second of January 1st, 1970. Seriously.)
4. The time module is part of the Python standard library. It contains functions to convert between different
time representations, format time values into strings, and fiddle with timezones.
5. The time.localtime() function converts a time value from seconds-since-the-Epoch (from the st_mtime
property returned from the os.stat() function) into a more useful structure of year, month, day, hour,
minute, second, and so on. This file was last modified on July 13, 2009, at around 5:25 PM.
# continued from the previous example
>>> metadata.st_size
①
3070
>>> import humansize
>>> humansize.approximate_size(metadata.st_size)
②
'3.0 KiB'
97
1. The os.stat() function also returns the size of a file, in the st_size property. The file feed.xml is 3070
bytes.
2. You can pass the st_size property to the approximate_size() function.
3.2.5. CONSTRUCTING ABSOLUTE PATHNAMES
In the previous section, the glob.glob() function returned a list of relative pathnames. The first example had pathnames like 'examples\feed.xml', and the second example had even shorter relative pathnames like
'romantest1.py'. As long as you stay in the same current working directory, these relative pathnames will
work for opening files or getting file metadata. But if you want to construct an absolute pathname — i.e. one
that includes all the directory names back to the root directory or drive letter — then you’ll need the
os.path.realpath() function.
>>> import os
>>> print(os.getcwd())
c:\Users\pilgrim\diveintopython3\examples
>>> print(os.path.realpath('feed.xml'))
c:\Users\pilgrim\diveintopython3\examples\feed.xml
⁂
98
3.3. LIST COMPREHENSIONS
A list comprehension provides a compact way of
mapping a list into another list by applying a function to
each of the elements of the list.
>>> a_list = [1, 9, 8, 4]
You can use
any Python
expression
in a list
comprehension.
>>> [elem * 2 for elem in a_list]
①
[2, 18, 16, 8]
>>> a_list
②
[1, 9, 8, 4]
>>> a_list = [elem * 2 for elem in a_list]
③
>>> a_list
[2, 18, 16, 8]
1. To make sense of this, look at it from right to left. a_list is the list you’re mapping. The Python
interpreter loops through a_list one element at a time, temporarily assigning the value of each element to
the variable elem. Python then applies the function elem * 2 and appends that result to the returned list.
2. A list comprehension creates a new list; it does not change the original list.
3. It is safe to assign the result of a list comprehension to the variable that you’re mapping. Python constructs
the new list in memory, and when the list comprehension is complete, it assigns the result to the original
variable.
99
You can use any Python expression in a list comprehension, including the functions in the os module for
manipulating files and directories.
>>> import os, glob
>>> glob.glob('*.xml')
①
['feed-broken.xml', 'feed-ns0.xml', 'feed.xml']
>>> [os.path.realpath(f) for f in glob.glob('*.xml')]
②
['c:\\Users\\pilgrim\\diveintopython3\\examples\\feed-broken.xml',
'c:\\Users\\pilgrim\\diveintopython3\\examples\\feed-ns0.xml',
'c:\\Users\\pilgrim\\diveintopython3\\examples\\feed.xml']
1. This returns a list of all the .xml files in the current working directory.
2. This list comprehension takes that list of .xml files and transforms it into a list of full pathnames.
List comprehensions can also filter items, producing a result that can be smaller than the original list.
>>> import os, glob
>>> [f for f in glob.glob('*.py') if os.stat(f).st_size > 6000]
①
['pluraltest6.py',
'romantest10.py',
'romantest6.py',
'romantest7.py',
'romantest8.py',
'romantest9.py']
1. To filter a list, you can include an if clause at the end of the list comprehension. The expression after the
if keyword will be evaluated for each item in the list. If the expression evaluates to True, the item will be
included in the output. This list comprehension looks at the list of all .py files in the current directory, and
the if expression filters that list by testing whether the size of each file is greater than 6000 bytes. There
are six such files, so the list comprehension returns a list of six filenames.
All the examples of list comprehensions so far have featured simple expressions — multiply a number by a
constant, call a single function, or simply return the original list item (after filtering). But there’s no limit to
how complex a list comprehension can be.
100
>>> import os, glob
>>> [(os.stat(f).st_size, os.path.realpath(f)) for f in glob.glob('*.xml')]
①
[(3074, 'c:\\Users\\pilgrim\\diveintopython3\\examples\\feed-broken.xml'),
(3386, 'c:\\Users\\pilgrim\\diveintopython3\\examples\\feed-ns0.xml'),
(3070, 'c:\\Users\\pilgrim\\diveintopython3\\examples\\feed.xml')]
>>> import humansize
>>> [(humansize.approximate_size(os.stat(f).st_size), f) for f in glob.glob('*.xml')]
②
[('3.0 KiB', 'feed-broken.xml'),
('3.3 KiB', 'feed-ns0.xml'),
('3.0 KiB', 'feed.xml')]
1. This list comprehension finds all the .xml files in the current working directory, gets the size of each file (by
calling the os.stat() function), and constructs a tuple of the file size and the absolute path of each file (by
calling the os.path.realpath() function).
2. This comprehension builds on the previous one to call the approximate_size() function with the file size of each .xml file.
⁂
3.4. DICTIONARY COMPREHENSIONS
A dictionary comprehension is like a list comprehension, but it constructs a dictionary instead of a list.
101
>>> import os, glob
>>> metadata = [(f, os.stat(f)) for f in glob.glob('*test*.py')]
①
>>> metadata[0]
②
('alphameticstest.py', nt.stat_result(st_mode=33206, st_ino=0, st_dev=0,
st_nlink=0, st_uid=0, st_gid=0, st_size=2509, st_atime=1247520344,
st_mtime=1247520344, st_ctime=1247520344))
>>> metadata_dict = {f:os.stat(f) for f in glob.glob('*test*.py')}
③
>>> type(metadata_dict)
④
<class 'dict'>
>>> list(metadata_dict.keys())
⑤
['romantest8.py', 'pluraltest1.py', 'pluraltest2.py', 'pluraltest5.py',
'pluraltest6.py', 'romantest7.py', 'romantest10.py', 'romantest4.py',
'romantest9.py', 'pluraltest3.py', 'romantest1.py', 'romantest2.py',
'romantest3.py', 'romantest5.py', 'romantest6.py', 'alphameticstest.py',
'pluraltest4.py']
>>> metadata_dict['alphameticstest.py'].st_size
⑥
2509
1. This is not a dictionary comprehension; it’s a list comprehension. It finds all .py files with test in their name, then constructs a tuple of the filename and the file metadata (from calling the os.stat() function).
2. Each item of the resulting list is a tuple.
3. This is a dictionary comprehension. The syntax is similar to a list comprehension, with two differences. First,
it is enclosed in curly braces instead of square brackets. Second, instead of a single expression for each item,
it contains two expressions separated by a colon. The expression before the colon (f in this example) is the
dictionary key; the expression after the colon (os.stat(f) in this example) is the value.
4. A dictionary comprehension returns a dictionary.
5. The keys of this particular dictionary are simply the filenames returned from the call to
glob.glob('*test*.py').
6. The value associated with each key is the return value from the os.stat() function. That means we can
“look up” a file by name in this dictionary to get its file metadata. One of the pieces of metadata is st_size,
the file size. The file alphameticstest.py is 2509 bytes long.
Like list comprehensions, you can include an if clause in a dictionary comprehension to filter the input
sequence based on an expression which is evaluated with each item.
102
>>> import os, glob, humansize
>>> metadata_dict = {f:os.stat(f) for f in glob.glob('*')}
①
>>> humansize_dict = {os.path.splitext(f)[0]:humansize.approximate_size(meta.st_size) \
...
for f, meta in metadata_dict.items() if meta.st_size > 6000}
②
>>> list(humansize_dict.keys())
③
['romantest9', 'romantest8', 'romantest7', 'romantest6', 'romantest10', 'pluraltest6']
>>> humansize_dict['romantest9']
④
'6.5 KiB'
1. This dictionary comprehension constructs a list of all the files in the current working directory
(glob.glob('*')), gets the file metadata for each file (os.stat(f)), and constructs a dictionary whose keys
are filenames and whose values are the metadata for each file.
2. This dictionary comprehension builds on the previous comprehension, filters out files smaller than 6000 bytes
(if meta.st_size > 6000), and uses that filtered list to construct a dictionary whose keys are the filename
minus the extension (os.path.splitext(f)[0]) and whose values are the approximate size of each file
(humansize.approximate_size(meta.st_size)).
3. As you saw in a previous example, there are six such files, thus there are six items in this dictionary.
4. The value of each key is the string returned from the approximate_size() function.
3.4.1. OTHER FUN STUFF TO DO WITH DICTIONARY COMPREHENSIONS
Here’s a trick with dictionary comprehensions that might be useful someday: swapping the keys and values of
a dictionary.
>>> a_dict = {'a': 1, 'b': 2, 'c': 3}
>>> {value:key for key, value in a_dict.items()}
{1: 'a', 2: 'b', 3: 'c'}
Of course, this only works if the values of the dictionary are immutable, like strings or tuples. If you try this
with a dictionary that contains lists, it will fail most spectacularly.
103
>>> a_dict = {'a': [1, 2, 3], 'b': 4, 'c': 5}
>>> {value:key for key, value in a_dict.items()}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <dictcomp>
TypeError: unhashable type: 'list'
⁂
3.5. SET COMPREHENSIONS
Not to be left out, sets have their own comprehension syntax as well. It is remarkably similar to the syntax
for dictionary comprehensions. The only difference is that sets just have values instead of key:value pairs.
>>> a_set = set(range(10))
>>> a_set
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> {x ** 2 for x in a_set}
①
{0, 1, 4, 81, 64, 9, 16, 49, 25, 36}
>>> {x for x in a_set if x % 2 == 0}
②
{0, 8, 2, 4, 6}
>>> {2**x for x in range(10)}
③
{32, 1, 2, 4, 8, 64, 128, 256, 16, 512}
1. Set comprehensions can take a set as input. This set comprehension calculates the squares of the set of
numbers from 0 to 9.
2. Like list comprehensions and dictionary comprehensions, set comprehensions can contain an if clause to
filter each item before returning it in the result set.
3. Set comprehensions do not need to take a set as input; they can take any sequence.
⁂
104
3.6. FURTHER READING
• os — Portable access to operating system specific features
• os.path — Platform-independent manipulation of file names
• glob — Filename pattern matching
• time — Functions for manipulating clock time
105
CHAPTER 4. STRINGS
❝ I’m telling you this ’cause you’re one of my friends.
My alphabet starts where your alphabet ends! ❞
— Dr. Seuss, On Beyond Zebra!
4.1. SOME BORING STUFF YOU NEED TO UNDERSTAND BEFORE YOU
CAN DIVE IN
Fewpeoplethinkaboutit,buttextisincrediblycomplicated.Startwiththealphabet.Thepeopleof
Bougainville have the smallest alphabet in the world; their Rotokas alphabet is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese,
and Korean have thousands of characters. English, of course, has 26 letters — 52 if you count uppercase and
lowercase separately — plus a handful of !@#$%& punctuation marks.
When you talk about “text,” you’re probably thinking of “characters and symbols on my computer screen.”
But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve
ever seen on a computer screen is actually stored in a particular character encoding. Very roughly speaking,
the character encoding provides a mapping between the stuff you see on your screen and the stuff your
computer actually stores in memory and on disk. There are many different character encodings, some
optimized for particular languages like Russian or Chinese or English, and others that can be used for
multiple languages.
In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each
encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So
you can think of the character encoding as a kind of decryption key. Whenever someone gives you a
sequence of bytes — a file, a web page, whatever — and claims it’s “text,” you need to know what character
encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key
at all, you’re left with the unenviable task of cracking the code yourself. Chances are you’ll get it wrong, and
the result will be gibberish.
106
Surely you’ve seen web pages like this, with strange
question-mark-like characters where apostrophes should
be. That usually means the page author didn’t declare
their character encoding correctly, your browser was
left guessing, and the result was a mix of expected and
Everything
unexpected characters. In English it’s merely annoying; in
other languages, the result can be completely
you thought
unreadable.
you knew
There are character encodings for each major language
in the world. Since each language is different, and
about
memory and disk space have historically been expensive,
each character encoding is optimized for a particular
strings is
language. By that, I mean each encoding using the same
numbers (0–255) to represent that language’s characters.
wrong.
For instance, you’re probably familiar with the ASCII
encoding, which stores English characters as numbers
ranging from 0 to 127. (65 is capital “A”, 97 is
lowercase “a”, & c.) English has a very simple alphabet,
so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2,
that’s 7 out of the 8 bits in a byte.
Western European languages like French, Spanish, and German have more letters than English. Or, more
precisely, they have letters combined with various diacritical marks, like the ñ character in Spanish. The most
common encoding for these languages is CP-1252, also called “windows-1252” because it is widely used on
Microsoft Windows. The CP-1252 encoding shares characters with ASCII in the 0–127 range, but then
extends into the 128–255 range for characters like n-with-a-tilde-over-it (241), u-with-two-dots-over-it (252),
& c. It’s still a single-byte encoding, though; the highest possible number, 255, still fits in one byte.
Then there are languages like Chinese, Japanese, and Korean, which have so many characters that they
require multiple-byte character sets. That is, each “character” is represented by a two-byte number from
0–65535. But different multi-byte encodings still share the same problem as different single-byte encodings,
namely that they each use the same numbers to mean different things. It’s just that the range of numbers is
broader, because there are many more characters to represent.
107
That was mostly OK in a non-networked world, where “text” was something you typed yourself and
occasionally printed. There wasn’t much “plain text”. Source code was ASCII, and everyone else used word
processors, which defined their own (non-text) formats that tracked character encoding information along
with rich styling, & c. People read these documents with the same word processing program as the original
author, so everything worked, more or less.
Now think about the rise of global networks like email and the web. Lots of “plain text” flying around the
globe, being authored on one computer, transmitted through a second computer, and received and displayed
by a third computer. Computers can only see numbers, but the numbers could mean different things. Oh no!
What to do? Well, systems had to be designed to carry encoding information along with every piece of
“plain text.” Remember, it’s the decryption key that maps computer-readable numbers to human-readable
characters. A missing decryption key means garbled text, gibberish, or worse.
Now think about trying to store multiple pieces of text in the same place, like in the same database table
that holds all the email you’ve ever received. You still need to store the character encoding alongside each
piece of text so you can display it properly. Think that’s hard? Try searching your email database, which
means converting between multiple encodings on the fly. Doesn’t that sound fun?
Now think about the possibility of multilingual documents, where characters from several languages are next
to each other in the same document. (Hint: programs that tried to do this typically used escape codes to
switch “modes.” Poof, you’re in Russian koi8-r mode, so 241 means Я; poof, now you’re in Mac Greek
mode, so 241 means ώ.) And of course you’ll want to search those documents, too.
Now cry a lot, because everything you thought you knew about strings is wrong, and there ain’t no such
thing as “plain text.”
⁂
4.2. UNICODE
Enter Unicode.
108
Unicode is a system designed to represent every character from every language. Unicode represents each
letter, character, or ideograph as a 4-byte number. Each number represents a unique character used in at
least one of the world’s languages. (Not all the numbers are used, but more than 65535 of them are, so 2
bytes wouldn’t be sufficient.) Characters that are used in multiple languages generally have the same number,
unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and
exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep
track of. U+0041 is always 'A', even if your language doesn’t have an 'A' in it.
On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per
document. No more “mode switching” to switch between encodings mid-stream. But right away, the obvious
question should leap out at you. Four bytes? For every single character‽ That seems awfully wasteful,
especially for languages like English and Spanish, which need less than one byte (256 numbers) to express
every possible character. In fact, it’s wasteful even for ideograph-based languages (like Chinese), which never
need more than two bytes per character.
There is a Unicode encoding that uses four bytes per character. It’s called UTF-32, because 32 bits = 4
bytes. UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and
represents the character with that same number. This has some advantages, the most important being that
you can find the Nth character of a string in constant time, because the Nth character starts at the 4×Nth
byte. It also has several disadvantages, the most obvious being that it takes four freaking bytes to store every
freaking character.
Even though there are a lot of Unicode characters, it turns out that most people will never use anything
beyond the first 65535. Thus, there is another Unicode encoding, called UTF-16 (because 16 bits = 2 bytes).
UTF-16 encodes every character from 0–65535 as two bytes, then uses some dirty hacks if you actually need
to represent the rarely-used “astral plane” Unicode characters beyond 65535. Most obvious advantage:
UTF-16 is twice as space-efficient as UTF-32, because every character requires only two bytes to store
instead of four bytes (except for the ones that don’t). And you can still easily find the Nth character of a
string in constant time, if you assume that the string doesn’t include any astral plane characters, which is a
good assumption right up until the moment that it’s not.
But there are also non-obvious disadvantages to both UTF-32 and UTF-16. Different computer systems store
individual bytes in different ways. That means that the character U+4E2D could be stored in UTF-16 as either
4E 2D or 2D 4E, depending on whether the system is big-endian or little-endian. (For UTF-32, there are even
109
more possible byte orderings.) As long as your documents never leave your computer, you’re
safe — different applications on the same computer will all use the same byte order. But the minute you
want to transfer documents between systems, perhaps on a world wide web of some sort, you’re going to
need a way to indicate which order your bytes are stored. Otherwise, the receiving system has no way of
knowing whether the two-byte sequence 4E 2D means U+4E2D or U+2D4E.
To solve this problem, the multi-byte Unicode encodings define a “Byte Order Mark,” which is a special non-
printable character that you can include at the beginning of your document to indicate what order your
bytes are in. For UTF-16, the Byte Order Mark is U+FEFF. If you receive a UTF-16 document that starts
with the bytes FF FE, you know the byte ordering is one way; if it starts with FE FF, you know the byte
ordering is reversed.
Still, UTF-16 isn’t exactly ideal, especially if you’re dealing with a lot of ASCII characters. If you think about
it, even a Chinese web page is going to contain a lot of ASCII characters — all the elements and attributes
surrounding the printable Chinese characters. Being able to find the Nth character in constant time is nice,
but there’s still the nagging problem of those astral plane characters, which mean that you can’t guarantee
that every character is exactly two bytes, so you can’t really find the Nth character in constant time unless
you maintain a separate index. And boy, there sure is a lot of ASCII text in the world…
Other people pondered these questions, and they came up with a solution:
UTF-8
110
UTF-8 is a variable-length encoding system for Unicode. That is, different characters take up a different
number of bytes. For ASCII characters (A-Z, & c.) UTF-8 uses just one byte per character. In fact, it uses
the exact same bytes; the first 128 characters (0–127) in UTF-8 are indistinguishable from ASCII. “Extended
Latin” characters like ñ and ö end up taking two bytes. (The bytes are not simply the Unicode code point
like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like 中 end
up taking three bytes. The rarely-used “astral plane” characters take four bytes.
Disadvantages: because each character can take a different number of bytes, finding the Nth character is an
O(N) operation — that is, the longer the string, the longer it takes to find a specific character. Also, there is
bit-twiddling involved to encode characters into bytes and decode bytes into characters.
Advantages: super-efficient encoding of common ASCII characters. No worse than UTF-16 for extended Latin
characters. Better than UTF-32 for Chinese characters. Also (and you’ll have to trust me on this, because
I’m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-
ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer.
⁂
4.3. DIVING IN
In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string
encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question.
U T F -8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a
sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a
sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters;
bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
111
>>> s = '深入 Python'
①
>>> len(s)
②
9
>>> s[0]
③
'深'
>>> s + ' 3'
④
'深入 Python 3'
1. To create a string, enclose it in quotes. Python strings can be defined with either single quotes (') or double
quotes (").
2. The built-in len() function returns the length of the string, i.e. the number of characters. This is the same
function you use to find the length of a list, tuple, set, or dictionary. A string is like a tuple of characters.
3. Just like getting individual items out of a list, you can get individual characters out of a string using index
notation.
4. Just like lists, you can concatenate strings using the + operator.
⁂
4.4. FORMATTING STRINGS
Let’s take another look at humansize.py:
Strings can
be defined
112
with either
single or
double
quotes.
113
SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
①
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
def approximate_size(size, a_kilobyte_is_1024_bytes=True):
'''Convert a file size to human-readable form.
②
Keyword arguments:
size -- file size in bytes
a_kilobyte_is_1024_bytes -- if True (default), use multiples of 1024
if False, use multiples of 1000
Returns: string
'''
③
if size < 0:
raise ValueError('number must be non-negative')
④
multiple = 1024 if a_kilobyte_is_1024_bytes else 1000
for suffix in SUFFIXES[multiple]:
size /= multiple
if size < multiple:
return '{0:.1f} {1}'.format(size, suffix)
⑤
raise ValueError('number too large')
1. 'KB', 'MB', 'GB'… those are each strings.
2. Function docstrings are strings. This docstring spans multiple lines, so it uses three-in-a-row quotes to start
and end the string.
3. These three-in-a-row quotes end the docstring.
4. There’s another string, being passed to the exception as a human-readable error message.
5. There’s a… whoa, what the heck is that?
Python 3 supports formatting values into strings. Although this can include very complicated expressions, the
most basic usage is to insert a value into a string with a single placeholder.
114
>>> username = 'mark'
>>> password = 'PapayaWhip'
①
>>> "{0}'s password is {1}".format(username, password)
②
"mark's password is PapayaWhip"
1. No, my password is not really PapayaWhip.
2. There’s a lot going on here. First, that’s a method call on a string literal. Strings are objects, and objects have
methods. Second, the whole expression evaluates to a string. Third, {0} and {1} are replacement fields, which
are replaced by the arguments passed to the format() method.
4.4.1. COMPOUND FIELD NAMES
The previous example shows the simplest case, where the replacement fields are simply integers. Integer
replacement fields are treated as positional indices into the argument list of the format() method. That
means that {0} is replaced by the first argument (username in this case), {1} is replaced by the second
argument (password), & c. You can have as many positional indices as you have arguments, and you can have
as many arguments as you want. But replacement fields are much more powerful than that.
>>> import humansize
>>> si_suffixes = humansize.SUFFIXES[1000]
①
>>> si_suffixes
['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']
>>> '1000{0[0]} = 1{0[1]}'.format(si_suffixes)
②
'1000KB = 1MB'
1. Rather than calling any function in the humansize module, you’re just grabbing one of the data structures it
defines: the list of “SI” (powers-of-1000) suffixes.
2. This looks complicated, but it’s not. {0} would refer to the first argument passed to the format() method,
si_suffixes. But si_suffixes is a list. So {0[0]} refers to the first item of the list which is the first
argument passed to the format() method: 'KB'. Meanwhile, {0[1]} refers to the second item of the same
list: 'MB'. Everything outside the curly braces — including 1000, the equals sign, and the spaces — is
untouched. The final result is the string '1000KB = 1MB'.
115
What this example shows is that format specifiers can
access items and properties of data structures using (almost)
Python syntax. This is called compound field names. The
following compound field names “just work”:
{0} is
• Passing a list, and accessing an item of the list by index
(as in the previous example)
replaced by
• Passing a dictionary, and accessing a value of the
dictionary by key
the 1st
• Passing a module, and accessing its variables and
functions by name
format()
• Passing a class instance, and accessing its properties and
methods by name
argument.
• Any combination of the above
{1} is
Just to blow your mind, here’s an example that
combines all of the above:
replaced by
>>> import humansize
>>> import sys
the 2nd.
>>> '1MB = 1000{0.modules[humansize].SUFFIXES[1000][0]}'.format(sys)
'1MB = 1000KB'
Here’s how it works:
• The sys module holds information about the currently running Python instance. Since you just imported it,
you can pass the sys module itself as an argument to the format() method. So the replacement field {0}
refers to the sys module.
• sys.modules is a dictionary of all the modules that have been imported in this Python instance. The keys
are the module names as strings; the values are the module objects themselves. So the replacement field
{0.modules} refers to the dictionary of imported modules.
116
• sys.modules['humansize'] is the humansize module which you just imported. The replacement field
{0.modules[humansize]} refers to the humansize module. Note the slight difference in syntax here. In real
Python code, the keys of the sys.modules dictionary are strings; to refer to them, you need to put quotes
around the module name ( e.g. 'humansize'). But within a replacement field, you skip the quotes around the
dictionary key name ( e.g. humansize). To quote PEP 3101: Advanced String Formatting, “The rules for parsing an item key are very simple. If it starts with a digit, then it is treated as a number, otherwise it is
used as a string.”
• sys.modules['humansize'].SUFFIXES is the dictionary defined at the top of the humansize module. The
replacement field {0.modules[humansize].SUFFIXES} refers to that dictionary.
• sys.modules['humansize'].SUFFIXES[1000] is a list of SI suffixes: ['KB', 'MB', 'GB', 'TB', 'PB',
'EB', 'ZB', 'YB']. So the replacement field {0.modules[humansize].SUFFIXES[1000]} refers to that list.
• sys.modules['humansize'].SUFFIXES[1000][0] is the first item of the list of SI suffixes: 'KB'. Therefore,
the complete replacement field {0.modules[humansize].SUFFIXES[1000][0]} is replaced by the two-
character string KB.
4.4.2. FORMAT SPECIFIERS
But wait! There’s more! Let’s take another look at that strange line of code from humansize.py:
if size < multiple:
return '{0:.1f} {1}'.format(size, suffix)
{1} is replaced with the second argument passed to the format() method, which is suffix. But what is
{0:.1f}? It’s two things: {0}, which you recognize, and :.1f, which you don’t. The second half (including
and after the colon) defines the format specifier, which further refines how the replaced variable should be
formatted.
☞ Format specifiers allow you to munge the replacement text in a variety of useful
ways, like the printf() function in C. You can add zero- or space-padding, align
strings, control decimal precision, and even convert numbers to hexadecimal.
117
Within a replacement field, a colon (:) marks the start of the format specifier. The format specifier “.1”
means “round to the nearest tenth” ( i.e. display only one digit after the decimal point). The format specifier
“f” means “fixed-point number” (as opposed to exponential notation or some other decimal representation).
Thus, given a size of 698.24 and suffix of 'GB', the formatted string would be '698.2 GB', because
698.24 gets rounded to one decimal place, then the suffix is appended after the number.
>>> '{0:.1f} {1}'.format(698.24, 'GB')
'698.2 GB'
For all the gory details on format specifiers, consult the Format Specification Mini-Language in the official Python documentation.
⁂
4.5. OTHER COMMON STRING METHODS
Besides formatting, strings can do a number of other useful tricks.
118
>>> s = '''Finished files are the re-
①
... sult of years of scientif-
... ic study combined with the
... experience of years.'''
>>> s.splitlines()
②
['Finished files are the re-',
'sult of years of scientif-',
'ic study combined with the',
'experience of years.']
>>> print(s.lower())
③
finished files are the re-
sult of years of scientif-
ic study combined with the
experience of years.
>>> s.lower().count('f')
④
6
1. You can input multiline strings in the Python interactive shell. Once you start a multiline string with triple
quotation marks, just hit ENTER and the interactive shell will prompt you to continue the string. Typing the
closing triple quotation marks ends the string, and the next ENTER will execute the command (in this case,
assigning the string to s).
2. The splitlines() method takes one multiline string and returns a list of strings, one for each line of the
original. Note that the carriage returns at the end of each line are not included.
3. The lower() method converts the entire string to lowercase. (Similarly, the upper() method converts a
string to uppercase.)
4. The count() method counts the number of occurrences of a substring. Yes, there really are six “f”s in that
sentence!
Here’s another common case. Let’s say you have a list of key-value pairs in the form
key1=value1&key2=value2, and you want to split them up and make a dictionary of the form {key1:
value1, key2: value2}.
119
>>> query = 'user=pilgrim&database=master&password=PapayaWhip'
>>> a_list = query.split('&')
①
>>> a_list
['user=pilgrim', 'database=master', 'password=PapayaWhip']
>>> a_list_of_lists = [v.split('=', 1) for v in a_list if '=' in v]
②
>>> a_list_of_lists
[['user', 'pilgrim'], ['database', 'master'], ['password', 'PapayaWhip']]
>>> a_dict = dict(a_list_of_lists)
③
>>> a_dict
{'password': 'PapayaWhip', 'user': 'pilgrim', 'database': 'master'}
1. The split() string method has one required argument, a delimiter. The method splits a string into a list of
strings based on the delimiter. Here, the delimiter is an ampersand character, but it could be anything.
2. Now we have a list of strings, each with a key, followed by an equals sign, followed by a value. We can use
a list comprehension to iterate over the entire list and split each string into two strings based on the first equals sign. The optional second argument to the split() method is the number of times you want to split.
1 means “only split once,” so the split() method will return a two-item list. (In theory, a value could
contain an equals sign too. If you just used 'key=value=foo'.split('='), you would end up with a three-
item list ['key', 'value', 'foo'].)
3. Finally, Python can turn that list-of-lists into a dictionary simply by passing it to the dict() function.
☞ The previous example looks a lot like parsing query parameters in a URL, but real-life
U R L parsing is actually more complicated than this. If you’re dealing with U R L query
parameters, you’re better off using the urllib.parse.parse_qs() function, which handles some non-obvious edge cases.
4.5.1. SLICING A STRING
Once you’ve defined a string, you can get any part of it as a new string. This is called slicing the string. Slicing
strings works exactly the same as slicing lists, which makes sense, because strings are just sequences of characters.
120
>>> a_string = 'My alphabet starts where your alphabet ends.'
>>> a_string[3:11]
①
'alphabet'
>>> a_string[3:-3]
②
'alphabet starts where your alphabet en'
>>> a_string[0:2]
③
'My'
>>> a_string[:18]
④
'My alphabet starts'
>>> a_string[18:]
⑤
' where your alphabet ends.'
1. You can get a part of a string, called a “slice”, by specifying two indices. The return value is a new string
containing all the characters of the string, in order, starting with the first slice index.
2. Like slicing lists, you can use negative indices to slice strings.
3. Strings are zero-based, so a_string[0:2] returns the first two items of the string, starting at a_string[0],
up to but not including a_string[2].
4. If the left slice index is 0, you can leave it out, and 0 is implied. So a_string[:18] is the same as
a_string[0:18], because the starting 0 is implied.
5. Similarly, if the right slice index is the length of the string, you can leave it out. So a_string[18:] is the
same as a_string[18:44], because this string has 44 characters. There is a pleasing symmetry here. In this
44-character string, a_string[:18] returns the first 18 characters, and a_string[18:] returns everything
but the first 18 characters. In fact, a_string[:n] will always return the first n characters, and a_string[n:]
will return the rest, regardless of the length of the string.
⁂
4.6. STRINGS VS. BYTES
Bytes are bytes; characters are an abstraction. An immutable sequence of Unicode characters is called a
string. An immutable sequence of numbers-between-0-and-255 is called a bytes object.
121
>>> by = b'abcd\x65'
①
>>> by
b'abcde'
>>> type(by)
②
<class 'bytes'>
>>> len(by)
③
5
>>> by += b'\xff'
④
>>> by
b'abcde\xff'
>>> len(by)
⑤
6
>>> by[0]
⑥
97
>>> by[0] = 102
⑦
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bytes' object does not support item assignment
1. To define a bytes object, use the b'' “byte literal” syntax. Each byte within the byte literal can be an ASCII
character or an encoded hexadecimal number from \x00 to \xff (0–255).
2. The type of a bytes object is bytes.
3. Just like lists and strings, you can get the length of a bytes object with the built-in len() function.
4. Just like lists and strings, you can use the + operator to concatenate bytes objects. The result is a new
bytes object.
5. Concatenating a 5-byte bytes object and a 1-byte bytes object gives you a 6-byte bytes object.
6. Just like lists and strings, you can use index notation to get individual bytes in a bytes object. The items of a
string are strings; the items of a bytes object are integers. Specifically, integers between 0–255.
7. A bytes object is immutable; you can not assign individual bytes. If you need to change individual bytes, you
can either use string slicing and concatenation operators (which work the same as strings), or you can convert the bytes object into a bytearray object.
122
>>> by = b'abcd\x65'
>>> barr = bytearray(by)
①
>>> barr
bytearray(b'abcde')
>>> len(barr)
②
5
>>> barr[0] = 102
③
>>> barr
bytearray(b'fbcde')
1. To convert a bytes object into a mutable bytearray object, use the built-in bytearray() function.
2. All the methods and operations you can do on a bytes object, you can do on a bytearray object too.
3. The one difference is that, with the bytearray object, you can assign individual bytes using index notation.
The assigned value must be an integer between 0–255.
The one thing you can never do is mix bytes and strings.
>>> by = b'd'
>>> s = 'abcde'
>>> by + s
①
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
>>> s.count(by)
②
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't convert 'bytes' object to str implicitly
>>> s.count(by.decode('ascii'))
③
1
1. You can’t concatenate bytes and strings. They are two different data types.
2. You can’t count the occurrences of bytes in a string, because there are no bytes in a string. A string is a
sequence of characters. Perhaps you meant “count the occurrences of the string that you would get after
123
decoding this sequence of bytes in a particular character encoding”? Well then, you’ll need to say that
explicitly. Python 3 won’t implicitly convert bytes to strings or strings to bytes.
3. By an amazing coincidence, this line of code says “count the occurrences of the string that you would get
after decoding this sequence of bytes in this particular character encoding.”
And here is the link between strings and bytes: bytes objects have a decode() method that takes a
character encoding and returns a string, and strings have an encode() method that takes a character
encoding and returns a bytes object. In the previous example, the decoding was relatively
straightforward — converting a sequence of bytes in the ASCII encoding into a string of characters. But the
same process works with any encoding that supports the characters of the string — even legacy (non-
Unicode) encodings.
124
>>> a_string = '深入 Python'
①
>>> len(a_string)
9
>>> by = a_string.encode('utf-8')
②
>>> by
b'\xe6\xb7\xb1\xe5\x85\xa5 Python'
>>> len(by)
13
>>> by = a_string.encode('gb18030')
③
>>> by
b'\xc9\xee\xc8\xeb Python'
>>> len(by)
11
>>> by = a_string.encode('big5')
④
>>> by
b'\xb2`\xa4J Python'
>>> len(by)
11
>>> roundtrip = by.decode('big5')
⑤
>>> roundtrip
'深入 Python'
>>> a_string == roundtrip
True
1. This is a string. It has nine characters.
2. This is a bytes object. It has 13 bytes. It is the sequence of bytes you get when you take a_string and
encode it in UTF-8.
3. This is a bytes object. It has 11 bytes. It is the sequence of bytes you get when you take a_string and
encode it in GB18030.
4. This is a bytes object. It has 11 bytes. It is an entirely different sequence of bytes that you get when you take
a_string and encode it in Big5.
5. This is a string. It has nine characters. It is the sequence of characters you get when you take by and decode
it using the Big5 encoding algorithm. It is identical to the original string.
125
⁂
4.7. POSTSCRIPT: CHARACTER ENCODING OF PYTHON SOURCE CODE
Python 3 assumes that your source code — i.e. each .py file — is encoded in UTF-8.
☞ In Python 2, the default encoding for .py files was ASCII. In Python 3, the default
If you would like to use a different encoding within your Python code, you can put an encoding declaration
on the first line of each file. This declaration defines a .py file to be windows-1252:
# -*- coding: windows-1252 -*-
Technically, the character encoding override can also be on the second line, if the first line is a UNIX-like
hash-bang command.
#!/usr/bin/python3
# -*- coding: windows-1252 -*-
For more information, consult PEP 263: Defining Python Source Code Encodings.
⁂
4.8. FURTHER READING
On Unicode in Python:
126
• What’s New In Python 3: Text vs. Data Instead Of Unicode vs. 8-bit
• PEP 261 explains how Python handles astral characters outside of the Basic Multilingual Plane ( i.e. characters whose ordinal value is greater than 65535)
On Unicode in general:
• The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and
On character encoding in other formats:
On strings and string formatting:
• string — Common string operations
• Format Specification Mini-Language
• PEP 3101: Advanced String Formatting
127
CHAPTER 5. REGULAR EXPRESSIONS
❝ Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two
problems. ❞
5.1. DIVING IN
Gettingasmallbitoftextoutofalargeblockoftextisachallenge.InPython,stringshavemethods
for searching and replacing: index(), find(), split(), count(), replace(), & c. But these methods are
limited to the simplest of cases. For example, the index() method looks for a single, hard-coded substring,
and the search is always case-sensitive. To do case-insensitive searches of a string s, you must call
s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The
replace() and split() methods have the same limitations.
If your goal can be accomplished with string methods, you should use them. They’re fast and simple and easy
to read, and there’s a lot to be said for fast, simple, readable code. But if you find yourself using a lot of
different string functions with if statements to handle special cases, or if you’re chaining calls to split()
and join() to slice-and-dice your strings, you may need to move up to regular expressions.
Regular expressions are a powerful and (mostly) standardized way of searching, replacing, and parsing text
with complex patterns of characters. Although the regular expression syntax is tight and unlike normal code,
the result can end up being more readable than a hand-rolled solution that uses a long chain of string
functions. There are even ways of embedding comments within regular expressions, so you can include fine-
grained documentation within them.
☞ If you’ve used regular expressions in other languages (like Perl, JavaScript, or PHP),
Python’s syntax will be very familiar. Read the summary of the re module to get an overview of the available functions and their arguments.
128
⁂
5.2. CASE STUDY: STREET ADDRESSES
This series of examples was inspired by a real-life problem I had in my day job several years ago, when I
needed to scrub and standardize street addresses exported from a legacy system before importing them into
a newer system. (See, I don’t just make this stuff up; it’s actually useful.) This example shows how I
approached the problem.
>>> s = '100 NORTH MAIN ROAD'
>>> s.replace('ROAD', 'RD.')
①
'100 NORTH MAIN RD.'
>>> s = '100 NORTH BROAD ROAD'
>>> s.replace('ROAD', 'RD.')
②
'100 NORTH BRD. RD.'
>>> s[:-4] + s[-4:].replace('ROAD', 'RD.')
③
'100 NORTH BROAD RD.'
>>> import re
④
>>> re.sub('ROAD$', 'RD.', s)
⑤
'100 NORTH BROAD RD.'
1. My goal is to standardize a street address so that 'ROAD' is always abbreviated as 'RD.'. At first glance, I
thought this was simple enough that I could just use the string method replace(). After all, all the data was
already uppercase, so case mismatches would not be a problem. And the search string, 'ROAD', was a
constant. And in this deceptively simple example, s.replace() does indeed work.
2. Life, unfortunately, is full of counterexamples, and I quickly discovered this one. The problem here is that
'ROAD' appears twice in the address, once as part of the street name 'BROAD' and once as its own word.
The replace() method sees these two occurrences and blindly replaces both of them; meanwhile, I see my
addresses getting destroyed.
3. To solve the problem of addresses with more than one 'ROAD' substring, you could resort to something like
this: only search and replace 'ROAD' in the last four characters of the address (s[-4:]), and leave the string
alone (s[:-4]). But you can see that this is already getting unwieldy. For example, the pattern is dependent
on the length of the string you’re replacing. (If you were replacing 'STREET' with 'ST.', you would need to
129
use s[:-6] and s[-6:].replace(...).) Would you like to come back in six months and debug this? I know
I wouldn’t.
4. It’s time to move up to regular expressions. In Python, all functionality related to regular expressions is
contained in the re module.
5. Take a look at the first parameter: 'ROAD$'. This is a simple regular expression that matches 'ROAD' only
when it occurs at the end of a string. The $ means “end of the string.” (There is a corresponding character,
the caret ^, which means “beginning of the string.”) Using the re.sub() function, you search the string s for
the regular expression 'ROAD$' and replace it with 'RD.'. This matches the ROAD at the end of the string s,
but does not match the ROAD that’s part of the word BROAD, because that’s in the middle of s.
Continuing with my story of scrubbing addresses, I soon
discovered that the previous example, matching 'ROAD'
at the end of the address, was not good enough,
because not all addresses included a street designation
at all. Some addresses simply ended with the street
^ matches
name. I got away with it most of the time, but if the
street name was 'BROAD', then the regular expression
the start of
would match 'ROAD' at the end of the string as part of
the word 'BROAD', which is not what I wanted.
a string. $
>>> s = '100 BROAD'
>>> re.sub('ROAD$', 'RD.', s)
matches the
'100 BRD.'
end of a
>>> re.sub('\\bROAD$', 'RD.', s)
①
'100 BROAD'
string.
>>> re.sub(r'\bROAD$', 'RD.', s)
②
'100 BROAD'
>>> s = '100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD$', 'RD.', s)
③
'100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD\b', 'RD.', s)
④
'100 BROAD RD. APT 3'
130
1. What I really wanted was to match 'ROAD' when it was at the end of the string and it was its own word
(and not a part of some larger word). To express this in a regular expression, you use \b, which means “a
word boundary must occur right here.” In Python, this is complicated by the fact that the '\' character in a
string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason why
regular expressions are easier in Perl than in Python. On the down side, Perl mixes regular expressions with
other syntax, so if you have a bug, it may be hard to tell whether it’s a bug in syntax or a bug in your
regular expression.
2. To work around the backslash plague, you can use what is called a raw string, by prefixing the string with the
letter r. This tells Python that nothing in this string should be escaped; '\t' is a tab character, but r'\t' is
really the backslash character \ followed by the letter t. I recommend always using raw strings when dealing
with regular expressions; otherwise, things get too confusing too quickly (and regular expressions are
confusing enough already).
3. *sigh* Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address
contained the word 'ROAD' as a whole word by itself, but it wasn’t at the end, because the address had an
apartment number after the street designation. Because 'ROAD' isn’t at the very end of the string, it doesn’t
match, so the entire call to re.sub() ends up replacing nothing at all, and you get the original string back,
which is not what you want.
4. To solve this problem, I removed the $ character and added another \b. Now the regular expression reads
“match 'ROAD' when it’s a whole word by itself anywhere in the string,” whether at the end, the beginning,
or somewhere in the middle.
⁂
5.3. CASE STUDY: ROMAN NUMERALS
You’ve most likely seen Roman numerals, even if you didn’t recognize them. You may have seen them in
copyrights of old movies and television shows (“Copyright MCMXLVI” instead of “Copyright 1946”), or on the
dedication walls of libraries or universities (“established MDCCCLXXXVIII” instead of “established 1888”). You
may also have seen them in outlines and bibliographical references. It’s a system of representing numbers
that really does date back to the ancient Roman empire (hence the name).
131
In Roman numerals, there are seven characters that are repeated and combined in various ways to represent
numbers.
• I = 1
• V = 5
• X = 10
• L = 50
• C = 100
• D = 500
• M = 1000
The following are some general rules for constructing Roman numerals:
• Sometimes characters are additive. I is 1, II is 2, and III is 3. VI is 6 (literally, “5 and 1”), VII is 7, and
VIII is 8.
• The tens characters (I, X, C, and M) can be repeated up to three times. At 4, you need to subtract from the
next highest fives character. You can't represent 4 as IIII; instead, it is represented as IV (“1 less than 5”).
40 is written as XL (“10 less than 50”), 41 as XLI, 42 as XLII, 43 as XLIII, and then 44 as XLIV (“10 less
than 50, then 1 less than 5”).
• Sometimes characters are… the opposite of additive. By putting certain characters before others, you
subtract from the final value. For example, at 9, you need to subtract from the next highest tens character: 8
is VIII, but 9 is IX (“1 less than 10”), not VIIII (since the I character can not be repeated four times). 90
is XC, 900 is CM.
• The fives characters can not be repeated. 10 is always represented as X, never as VV. 100 is always C, never
LL.
• Roman numerals are read left to right, so the order of characters matters very much. DC is 600; CD is a
completely different number (400, “100 less than 500”). CI is 101; IC is not even a valid Roman numeral
(because you can't subtract 1 directly from 100; you would need to write it as XCIX, “10 less than 100, then
1 less than 10”).
132
5.3.1. CHECKING FOR THOUSANDS
What would it take to validate that an arbitrary string is a valid Roman numeral? Let’s take it one digit at a
time. Since Roman numerals are always written highest to lowest, let’s start with the highest: the thousands
place. For numbers 1000 and higher, the thousands are represented by a series of M characters.
>>> import re
>>> pattern = '^M?M?M?$'
①
>>> re.search(pattern, 'M')
②
<_sre.SRE_Match object at 0106FB58>
>>> re.search(pattern, 'MM')
③
<_sre.SRE_Match object at 0106C290>
>>> re.search(pattern, 'MMM')
④
<_sre.SRE_Match object at 0106AA38>
>>> re.search(pattern, 'MMMM')
⑤
>>> re.search(pattern, '')
⑥
<_sre.SRE_Match object at 0106F4A8>
1. This pattern has three parts. ^ matches what follows only at the beginning of the string. If this were not
specified, the pattern would match no matter where the M characters were, which is not what you want.
You want to make sure that the M characters, if they’re there, are at the beginning of the string. M?
optionally matches a single M character. Since this is repeated three times, you’re matching anywhere from
zero to three M characters in a row. And $ matches the end of the string. When combined with the ^
character at the beginning, this means that the pattern must match the entire string, with no other
characters before or after the M characters.
2. The essence of the re module is the search() function, that takes a regular expression (pattern) and a
string ('M') to try to match against the regular expression. If a match is found, search() returns an object
which has various methods to describe the match; if no match is found, search() returns None, the Python
null value. All you care about at the moment is whether the pattern matches, which you can tell by just
looking at the return value of search(). 'M' matches this regular expression, because the first optional M
matches and the second and third optional M characters are ignored.
3. 'MM' matches because the first and second optional M characters match and the third M is ignored.
4. 'MMM' matches because all three M characters match.
133
5. 'MMMM' does not match. All three M characters match, but then the regular expression insists on the string
ending (because of the $ character), and the string doesn’t end yet (because of the fourth M). So search()
returns None.
6. Interestingly, an empty string also matches this regular expression, since all the M characters are optional.
5.3.2. CHECKING FOR HUNDREDS
The hundreds place is more difficult than the thousands,
because there are several mutually exclusive ways it
could be expressed, depending on its value.
• 100 = C
? makes a
• 200 = CC
• 300 = CCC
pattern
• 400 = CD
• 500 = D
optional.
• 600 = DC
• 700 = DCC
• 800 = DCCC
• 900 = CM
So there are four possible patterns:
• CM
• CD
• Zero to three C characters (zero if the hundreds place is 0)
• D, followed by zero to three C characters
The last two patterns can be combined:
• an optional D, followed by zero to three C characters
This example shows how to validate the hundreds place of a Roman numeral.
134
>>> import re
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
①
>>> re.search(pattern, 'MCM')
②
<_sre.SRE_Match object at 01070390>
>>> re.search(pattern, 'MD')
③
<_sre.SRE_Match object at 01073A50>
>>> re.search(pattern, 'MMMCCC')
④
<_sre.SRE_Match object at 010748A8>
>>> re.search(pattern, 'MCMC')
⑤
>>> re.search(pattern, '')
⑥