2.5.2. ASSIGNING MULTIPLE VALUES AT ONCE

Here’s a cool programming shortcut: in Python, you can use a tuple to assign multiple values at once.

>>> v = ('a', 2, True)

广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元

>>> (x, y, z) = v

>>> x

'a'

>>> y

2

>>> z

True

1. v is a tuple of three elements, and (x, y, z) is a tuple of three variables. Assigning one to the other

assigns each of the values of v to each of the variables, in order.

This has all kinds of uses. Suppose you want to assign names to a range of values. You can use the built-in

range() function with multi-variable assignment to quickly assign consecutive values.

>>> (MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY) = range(7)

>>> MONDAY

0

>>> TUESDAY

1

>>> SUNDAY

6

1. The built-in range() function constructs a sequence of integers. (Technically, the range() function returns

an iterator, not a list or a tuple, but you’ll learn about that distinction later.) MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, and SUNDAY are the variables you’re defining. (This example came from the

calendar module, a fun little module that prints calendars, like the U N I X program cal. The calendar

module defines integer constants for days of the week.)

2. Now each variable has its value: MONDAY is 0, TUESDAY is 1, and so forth.

74

You can also use multi-variable assignment to build functions that return multiple values, simply by returning

a tuple of all the values. The caller can treat it as a single tuple, or it can assign the values to individual

variables. Many standard Python libraries do this, including the os module, which you'll learn about in the

next chapter.

2.6. SETS

A set is an unordered “bag” of unique values. A single set can contain values of any immutable datatype.

Once you have two sets, you can do standard set operations like union, intersection, and set difference.

2.6.1. CREATING A SET

First things first. Creating a set is easy.

>>> a_set = {1}

>>> a_set

{1}

>>> type(a_set)

<class 'set'>

>>> a_set = {1, 2}

>>> a_set

{1, 2}

1. To create a set with one value, put the value in curly brackets ({}).

2. Sets are actually implemented as classes, but don’t worry about that for now.

3. To create a set with multiple values, separate the values with commas and wrap it all up with curly brackets.

You can also create a set out of a list.

75

>>> a_list = ['a', 'b', 'mpilgrim', True, False, 42]

>>> a_set = set(a_list)

>>> a_set

{'a', False, 'b', True, 'mpilgrim', 42}

>>> a_list

['a', 'b', 'mpilgrim', True, False, 42]

1. To create a set from a list, use the set() function. (Pedants who know about how sets are implemented

will point out that this is not really calling a function, but instantiating a class. I promise you will learn the

difference later in this book. For now, just know that set() acts like a function, and it returns a set.)

2. As I mentioned earlier, a single set can contain values of any datatype. And, as I mentioned earlier, sets are

unordered. This set does not remember the original order of the list that was used to create it. If you were

to add items to this set, it would not remember the order in which you added them.

3. The original list is unchanged.

Don’t have any values yet? Not a problem. You can create an empty set.

>>> a_set = set()

>>> a_set

set()

>>> type(a_set)

<class 'set'>

>>> len(a_set)

0

>>> not_sure = {}

>>> type(not_sure)

<class 'dict'>

1. To create an empty set, call set() with no arguments.

2. The printed representation of an empty set looks a bit strange. Were you expecting {}, perhaps? That

would denote an empty dictionary, not an empty set. You’ll learn about dictionaries later in this chapter.

3. Despite the strange printed representation, this is a set…

4. …and this set has no members.

76

5. Due to historical quirks carried over from Python 2, you can not create an empty set with two curly

brackets. This actually creates an empty dictionary, not an empty set.

2.6.2. MODIFYING A SET

There are two different ways to add values to an existing set: the add() method, and the update() method.

>>> a_set = {1, 2}

>>> a_set.add(4)

>>> a_set

{1, 2, 4}

>>> len(a_set)

3

>>> a_set.add(1)

>>> a_set

{1, 2, 4}

>>> len(a_set)

3

1. The add() method takes a single argument, which can be any datatype, and adds the given value to the set.

2. This set now has 3 members.

3. Sets are bags of unique values. If you try to add a value that already exists in the set, it will do nothing. It

won’t raise an error; it’s just a no-op.

4. This set still has 3 members.

77

>>> a_set = {1, 2, 3}

>>> a_set

{1, 2, 3}

>>> a_set.update({2, 4, 6})

>>> a_set

{1, 2, 3, 4, 6}

>>> a_set.update({3, 6, 9}, {1, 2, 3, 5, 8, 13})

>>> a_set

{1, 2, 3, 4, 5, 6, 8, 9, 13}

>>> a_set.update([10, 20, 30])

>>> a_set

{1, 2, 3, 4, 5, 6, 8, 9, 10, 13, 20, 30}

1. The update() method takes one argument, a set, and adds all its members to the original set. It’s as if you

called the add() method with each member of the set.

2. Duplicate values are ignored, since sets can not contain duplicates.

3. You can actually call the update() method with any number of arguments. When called with two sets, the

update() method adds all the members of each set to the original set (dropping duplicates).

4. The update() method can take objects of a number of different datatypes, including lists. When called with

a list, the update() method adds all the items of the list to the original set.

2.6.3. REMOVING ITEMS FROM A SET

There are three ways to remove individual values from a set. The first two, discard() and remove(), have

one subtle difference.

78

>>> a_set = {1, 3, 6, 10, 15, 21, 28, 36, 45}

>>> a_set

{1, 3, 36, 6, 10, 45, 15, 21, 28}

>>> a_set.discard(10)

>>> a_set

{1, 3, 36, 6, 45, 15, 21, 28}

>>> a_set.discard(10)

>>> a_set

{1, 3, 36, 6, 45, 15, 21, 28}

>>> a_set.remove(21)

>>> a_set

{1, 3, 36, 6, 45, 15, 28}

>>> a_set.remove(21)

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

KeyError: 21

1. The discard() method takes a single value as an argument and removes that value from the set.

2. If you call the discard() method with a value that doesn’t exist in the set, it does nothing. No error; it’s

just a no-op.

3. The remove() method also takes a single value as an argument, and it also removes that value from the set.

4. Here’s the difference: if the value doesn’t exist in the set, the remove() method raises a KeyError

exception.

Like lists, sets have a pop() method.

79

>>> a_set = {1, 3, 6, 10, 15, 21, 28, 36, 45}

>>> a_set.pop()

1

>>> a_set.pop()

3

>>> a_set.pop()

36

>>> a_set

{6, 10, 45, 15, 21, 28}

>>> a_set.clear()

>>> a_set

set()

>>> a_set.pop()

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

KeyError: 'pop from an empty set'

1. The pop() method removes a single value from a set and returns the value. However, since sets are

unordered, there is no “last” value in a set, so there is no way to control which value gets removed. It is

completely arbitrary.

2. The clear() method removes all values from a set, leaving you with an empty set. This is equivalent to

a_set = set(), which would create a new empty set and overwrite the previous value of the a_set

variable.

3. Attempting to pop a value from an empty set will raise a KeyError exception.

2.6.4. COMMON SET OPERATIONS

Python’s set type supports several common set operations.

80

>>> a_set = {2, 4, 5, 9, 12, 21, 30, 51, 76, 127, 195}

>>> 30 in a_set

True

>>> 31 in a_set

False

>>> b_set = {1, 2, 3, 5, 6, 8, 9, 12, 15, 17, 18, 21}

>>> a_set.union(b_set)

{1, 2, 195, 4, 5, 6, 8, 12, 76, 15, 17, 18, 3, 21, 30, 51, 9, 127}

>>> a_set.intersection(b_set)

{9, 2, 12, 5, 21}

>>> a_set.difference(b_set)

{195, 4, 76, 51, 30, 127}

>>> a_set.symmetric_difference(b_set)

{1, 3, 4, 6, 8, 76, 15, 17, 18, 195, 127, 30, 51}

1. To test whether a value is a member of a set, use the in operator. This works the same as lists.

2. The union() method returns a new set containing all the elements that are in either set.

3. The intersection() method returns a new set containing all the elements that are in both sets.

4. The difference() method returns a new set containing all the elements that are in a_set but not b_set.

5. The symmetric_difference() method returns a new set containing all the elements that are in exactly one

of the sets.

Three of these methods are symmetric.

81

# continued from the previous example

>>> b_set.symmetric_difference(a_set)

{3, 1, 195, 4, 6, 8, 76, 15, 17, 18, 51, 30, 127}

>>> b_set.symmetric_difference(a_set) == a_set.symmetric_difference(b_set)

True

>>> b_set.union(a_set) == a_set.union(b_set)

True

>>> b_set.intersection(a_set) == a_set.intersection(b_set)

True

>>> b_set.difference(a_set) == a_set.difference(b_set)

False

1. The symmetric difference of a_set from b_set looks different than the symmetric difference of b_set from

a_set, but remember, sets are unordered. Any two sets that contain all the same values (with none left

over) are considered equal.

2. And that’s exactly what happens here. Don’t be fooled by the Python Shell’s printed representation of these

sets. They contain the same values, so they are equal.

3. The union of two sets is also symmetric.

4. The intersection of two sets is also symmetric.

5. The difference of two sets is not symmetric. That makes sense; it’s analogous to subtracting one number

from another. The order of the operands matters.

Finally, there are a few questions you can ask of sets.

82

>>> a_set = {1, 2, 3}

>>> b_set = {1, 2, 3, 4}

>>> a_set.issubset(b_set)

True

>>> b_set.issuperset(a_set)

True

>>> a_set.add(5)

>>> a_set.issubset(b_set)

False

>>> b_set.issuperset(a_set)

False

1. a_set is a subset of b_set — all the members of a_set are also members of b_set.

2. Asking the same question in reverse, b_set is a superset of a_set, because all the members of a_set are

also members of b_set.

3. As soon as you add a value to a_set that is not in b_set, both tests return False.

2.6.5. SETS IN A BOOLEAN CONTEXT

You can use sets in a boolean context, such as an if statement.

>>> def is_it_true(anything):

...

if anything:

...

print("yes, it's true")

...

else:

...

print("no, it's false")

...

>>> is_it_true(set())

no, it's false

>>> is_it_true({'a'})

yes, it's true

>>> is_it_true({False})

yes, it's true

83

1. In a boolean context, an empty set is false.

2. Any set with at least one item is true.

3. Any set with at least one item is true. The value of the items is irrelevant.

2.7. DICTIONARIES

A dictionary is an unordered set of key-value pairs. When you add a key to a dictionary, you must also add

a value for that key. (You can always change the value later.) Python dictionaries are optimized for retrieving

the value when you know the key, but not the other way around.

☞ A dictionary in Python is like a hash in Perl 5. In Perl 5, variables that store hashes

always start with a % character. In Python, variables can be named anything, and

Python keeps track of the datatype internally.

2.7.1. CREATING A DICTIONARY

Creating a dictionary is easy. The syntax is similar to sets, but instead of values, you have key-value pairs.

Once you have a dictionary, you can look up values by their key.

84

>>> a_dict = {'server': 'db.diveintopython3.org', 'database': 'mysql'}

>>> a_dict

{'server': 'db.diveintopython3.org', 'database': 'mysql'}

>>> a_dict['server']

'db.diveintopython3.org'

>>> a_dict['database']

'mysql'

>>> a_dict['db.diveintopython3.org']

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

KeyError: 'db.diveintopython3.org'

1. First, you create a new dictionary with two items and assign it to the variable a_dict. Each item is a key-

value pair, and the whole set of items is enclosed in curly braces.

2. 'server' is a key, and its associated value, referenced by a_dict['server'], is

'db.diveintopython3.org'.

3. 'database' is a key, and its associated value, referenced by a_dict['database'], is 'mysql'.

4. You can get values by key, but you can’t get keys by value. So a_dict['server'] is

'db.diveintopython3.org', but a_dict['db.diveintopython3.org'] raises an exception, because

'db.diveintopython3.org' is not a key.

2.7.2. MODIFYING A DICTIONARY

Dictionaries do not have any predefined size limit. You can add new key-value pairs to a dictionary at any

time, or you can modify the value of an existing key. Continuing from the previous example:

85

>>> a_dict

{'server': 'db.diveintopython3.org', 'database': 'mysql'}

>>> a_dict['database'] = 'blog'

>>> a_dict

{'server': 'db.diveintopython3.org', 'database': 'blog'}

>>> a_dict['user'] = 'mark'

>>> a_dict

{'server': 'db.diveintopython3.org', 'user': 'mark', 'database': 'blog'}

>>> a_dict['user'] = 'dora'

>>> a_dict

{'server': 'db.diveintopython3.org', 'user': 'dora', 'database': 'blog'}

>>> a_dict['User'] = 'mark'

>>> a_dict

{'User': 'mark', 'server': 'db.diveintopython3.org', 'user': 'dora', 'database': 'blog'}

1. You can not have duplicate keys in a dictionary. Assigning a value to an existing key will wipe out the old

value.

2. You can add new key-value pairs at any time. This syntax is identical to modifying existing values.

3. The new dictionary item (key 'user', value 'mark') appears to be in the middle. In fact, it was just a

coincidence that the items appeared to be in order in the first example; it is just as much a coincidence that

they appear to be out of order now.

4. Assigning a value to an existing dictionary key simply replaces the old value with the new one.

5. Will this change the value of the user key back to "mark"? No! Look at the key closely — that’s a capital U

in "User". Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not

overwriting an existing one. It may look similar to you, but as far as Python is concerned, it’s completely

different.

2.7.3. MIXED-VALUE DICTIONARIES

Dictionaries aren’t just for strings. Dictionary values can be any datatype, including integers, booleans,

arbitrary objects, or even other dictionaries. And within a single dictionary, the values don’t all need to be

the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be

strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary.

86

In fact, you’ve already seen a dictionary with non-string keys and values, in your first Python program.

SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],

1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}

Let's tear that apart in the interactive shell.

>>> SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],

...

1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}

>>> len(SUFFIXES)

2

>>> 1000 in SUFFIXES

True

>>> SUFFIXES[1000]

['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']

>>> SUFFIXES[1024]

['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']

>>> SUFFIXES[1000][3]

'TB'

1. Like lists and sets, the len() function gives you the number of keys in a dictionary.

2. And like lists and sets, you can use the in operator to test whether a specific key is defined in a dictionary.

3. 1000 is a key in the SUFFIXES dictionary; its value is a list of eight items (eight strings, to be precise).

4. Similarly, 1024 is a key in the SUFFIXES dictionary; its value is also a list of eight items.

5. Since SUFFIXES[1000] is a list, you can address individual items in the list by their 0-based index.

87

2.7.4. DICTIONARIES IN A BOOLEAN CONTEXT

You can also use a dictionary in a boolean context, such

as an if statement.

>>> def is_it_true(anything):

...

if anything:

Empty

...

print("yes, it's true")

...

else:

dictionaries

...

print("no, it's false")

...

are false; all

>>> is_it_true({})

no, it's false

other

>>> is_it_true({'a': 1})

yes, it's true

dictionaries

1. In a boolean context, an empty dictionary is false.

are true.

2. Any dictionary with at least one key-value pair is true.

2.8. None

None is a special constant in Python. It is a null value. None is not the same as False. None is not 0. None is

not an empty string. Comparing None to anything other than None will always return False.

None is the only null value. It has its own datatype (NoneType). You can assign None to any variable, but you

can not create other NoneType objects. All variables whose value is None are equal to each other.

88

>>> type(None)

<class 'NoneType'>

>>> None == False

False

>>> None == 0

False

>>> None == ''

False

>>> None == None

True

>>> x = None

>>> x == None

True

>>> y = None

>>> x == y

True

2.8.1. None IN A BOOLEAN CONTEXT

In a boolean context, None is false and not None is true.

>>> def is_it_true(anything):

...

if anything:

...

print("yes, it's true")

...

else:

...

print("no, it's false")

...

>>> is_it_true(None)

no, it's false

>>> is_it_true(not None)

yes, it's true

89

2.9. FURTHER READING

Boolean operations

Numeric types

Sequence types

Set types

Mapping types

fractions module

math module

PEP 237: Unifying Long Integers and Integers

PEP 238: Changing the Division Operator

90

CHAPTER 3. COMPREHENSIONS

Our imagination is stretched to the utmost, not, as in fiction, to imagine things which are not really there, but just

to comprehend those things which are.

Richard Feynman

3.1. DIVING IN

Everyprogramminglanguagehasthatonefeature,acomplicatedthingintentionallymadesimple.If

you’re coming from another language, you could easily miss it, because your old language didn’t make that

thing simple (because it was busy making something else simple instead). This chapter will teach you about

list comprehensions, dictionary comprehensions, and set comprehensions: three related concepts centered

around one very powerful technique. But first, I want to take a little detour into two modules that will help

you navigate your local file system.

3.2. WORKING WITH FILES AND DIRECTORIES

Python 3 comes with a module called os, which stands for “operating system.” The os module contains a plethora of functions to get information on — and in some cases, to manipulate — local directories, files,

processes, and environment variables. Python does its best to offer a unified API across all supported

operating systems so your programs can run on any computer with as little platform-specific code as

possible.

91

3.2.1. THE CURRENT WORKING DIRECTORY

When you’re just getting started with Python, you’re going to spend a lot of time in the Python Shell.

Throughout this book, you will see examples that go like this:

1. Import one of the modules in the examples folder

2. Call a function in that module

3. Explain the result

If you don’t know about the current working directory,

step 1 will probably fail with an ImportError. Why?

Because Python will look for the example module in the

import search path, but it won’t find it because the

examples folder isn’t one of the directories in the

There is

search path. To get past this, you can do one of two

things:

always a

1. Add the examples folder to the import search path

current

2. Change the current working directory to the examples

folder

working

The current working directory is an invisible property

directory.

that Python holds in memory at all times. There is

always a current working directory, whether you’re in

the Python Shell, running your own Python script from

the command line, or running a Python CGI script on a

web server somewhere.

The os module contains two functions to deal with the current working directory.

92

>>> import os

>>> print(os.getcwd())

C:\Python31

>>> os.chdir('/Users/pilgrim/diveintopython3/examples')

>>> print(os.getcwd())

C:\Users\pilgrim\diveintopython3\examples

1. The os module comes with Python; you can import it anytime, anywhere.

2. Use the os.getcwd() function to get the current working directory. When you run the graphical Python

Shell, the current working directory starts as the directory where the Python Shell executable is. On

Windows, this depends on where you installed Python; the default directory is c:\Python31. If you run the

Python Shell from the command line, the current working directory starts as the directory you were in

when you ran python3.

3. Use the os.chdir() function to change the current working directory.

4. When I called the os.chdir() function, I used a Linux-style pathname (forward slashes, no drive letter) even

though I’m on Windows. This is one of the places where Python tries to paper over the differences between

operating systems.

3.2.2. WORKING WITH FILENAMES AND DIRECTORY NAMES

While we’re on the subject of directories, I want to point out the os.path module. os.path contains

functions for manipulating filenames and directory names.

>>> import os

>>> print(os.path.join('/Users/pilgrim/diveintopython3/examples/', 'humansize.py'))

/Users/pilgrim/diveintopython3/examples/humansize.py

>>> print(os.path.join('/Users/pilgrim/diveintopython3/examples', 'humansize.py'))

/Users/pilgrim/diveintopython3/examples\humansize.py

>>> print(os.path.expanduser('~'))

c:\Users\pilgrim

>>> print(os.path.join(os.path.expanduser('~'), 'diveintopython3', 'examples', 'humansize.py'))

c:\Users\pilgrim\diveintopython3\examples\humansize.py

93

1. The os.path.join() function constructs a pathname out of one or more partial pathnames. In this case, it

simply concatenates strings.

2. In this slightly less trivial case, calling the os.path.join() function will add an extra slash to the pathname

before joining it to the filename. It’s a backslash instead of a forward slash, because I constructed this

example on Windows. If you replicate this example on Linux or Mac OS X, you’ll see a forward slash

instead. Don’t fuss with slashes; always use os.path.join() and let Python do the right thing.

3. The os.path.expanduser() function will expand a pathname that uses ~ to represent the current user’s

home directory. This works on any platform where users have a home directory, including Linux, Mac OS X,

and Windows. The returned path does not have a trailing slash, but the os.path.join() function doesn’t

mind.

4. Combining these techniques, you can easily construct pathnames for directories and files in the user’s home

directory. The os.path.join() function can take any number of arguments. I was overjoyed when I

discovered this, since addSlashIfNecessary() is one of the stupid little functions I always need to write

when building up my toolbox in a new language. Do not write this stupid little function in Python; smart

people have already taken care of it for you.

os.path also contains functions to split full pathnames, directory names, and filenames into their constituent

parts.

>>> pathname = '/Users/pilgrim/diveintopython3/examples/humansize.py'

>>> os.path.split(pathname)

('/Users/pilgrim/diveintopython3/examples', 'humansize.py')

>>> (dirname, filename) = os.path.split(pathname)

>>> dirname

'/Users/pilgrim/diveintopython3/examples'

>>> filename

'humansize.py'

>>> (shortname, extension) = os.path.splitext(filename)

>>> shortname

'humansize'

>>> extension

'.py'

1. The split function splits a full pathname and returns a tuple containing the path and filename.

94

2. Remember when I said you could use multi-variable assignment to return multiple values from a function?

The os.path.split() function does exactly that. You assign the return value of the split function into a

tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple.

3. The first variable, dirname, receives the value of the first element of the tuple returned from the

os.path.split() function, the file path.

4. The second variable, filename, receives the value of the second element of the tuple returned from the

os.path.split() function, the filename.

5. os.path also contains the os.path.splitext() function, which splits a filename and returns a tuple

containing the filename and the file extension. You use the same technique to assign each of them to

separate variables.

3.2.3. LISTING DIRECTORIES

The glob module is another tool in the Python standard library. It’s an easy way to get the contents of a

directory programmatically, and it uses the sort of wildcards that you may already be familiar with from

working on the command line.

The glob

module uses

shell-like

wildcards.

95

>>> os.chdir('/Users/pilgrim/diveintopython3/')

>>> import glob

>>> glob.glob('examples/*.xml')

['examples\\feed-broken.xml',

'examples\\feed-ns0.xml',

'examples\\feed.xml']

>>> os.chdir('examples/')

>>> glob.glob('*test*.py')

['alphameticstest.py',

'pluraltest1.py',

'pluraltest2.py',

'pluraltest3.py',

'pluraltest4.py',

'pluraltest5.py',

'pluraltest6.py',

'romantest1.py',

'romantest10.py',

'romantest2.py',

'romantest3.py',

'romantest4.py',

'romantest5.py',

'romantest6.py',

'romantest7.py',

'romantest8.py',

'romantest9.py']

1. The glob module takes a wildcard and returns the path of all files and directories matching the wildcard. In

this example, the wildcard is a directory path plus “*.xml”, which will match all .xml files in the examples

subdirectory.

2. Now change the current working directory to the examples subdirectory. The os.chdir() function can

take relative pathnames.

3. You can include multiple wildcards in your glob pattern. This example finds all the files in the current

working directory that end in a .py extension and contain the word test anywhere in their filename.

96

3.2.4. GETTING FILE METADATA

Every modern file system stores metadata about each file: creation date, last-modified date, file size, and so

on. Python provides a single API to access this metadata. You don’t need to open the file; all you need is

the filename.

>>> import os

>>> print(os.getcwd())

c:\Users\pilgrim\diveintopython3\examples

>>> metadata = os.stat('feed.xml')

>>> metadata.st_mtime

1247520344.9537716

>>> import time

>>> time.localtime(metadata.st_mtime)

time.struct_time(tm_year=2009, tm_mon=7, tm_mday=13, tm_hour=17,

tm_min=25, tm_sec=44, tm_wday=0, tm_yday=194, tm_isdst=1)

1. The current working directory is the examples folder.

2. feed.xml is a file in the examples folder. Calling the os.stat() function returns an object that contains

several different types of metadata about the file.

3. st_mtime is the modification time, but it’s in a format that isn’t terribly useful. (Technically, it’s the number

of seconds since the Epoch, which is defined as the first second of January 1st, 1970. Seriously.)

4. The time module is part of the Python standard library. It contains functions to convert between different

time representations, format time values into strings, and fiddle with timezones.

5. The time.localtime() function converts a time value from seconds-since-the-Epoch (from the st_mtime

property returned from the os.stat() function) into a more useful structure of year, month, day, hour,

minute, second, and so on. This file was last modified on July 13, 2009, at around 5:25 PM.

# continued from the previous example

>>> metadata.st_size

3070

>>> import humansize

>>> humansize.approximate_size(metadata.st_size)

'3.0 KiB'

97

1. The os.stat() function also returns the size of a file, in the st_size property. The file feed.xml is 3070

bytes.

2. You can pass the st_size property to the approximate_size() function.

3.2.5. CONSTRUCTING ABSOLUTE PATHNAMES

In the previous section, the glob.glob() function returned a list of relative pathnames. The first example had pathnames like 'examples\feed.xml', and the second example had even shorter relative pathnames like

'romantest1.py'. As long as you stay in the same current working directory, these relative pathnames will

work for opening files or getting file metadata. But if you want to construct an absolute pathname — i.e. one

that includes all the directory names back to the root directory or drive letter — then you’ll need the

os.path.realpath() function.

>>> import os

>>> print(os.getcwd())

c:\Users\pilgrim\diveintopython3\examples

>>> print(os.path.realpath('feed.xml'))

c:\Users\pilgrim\diveintopython3\examples\feed.xml

98

3.3. LIST COMPREHENSIONS

A list comprehension provides a compact way of

mapping a list into another list by applying a function to

each of the elements of the list.

>>> a_list = [1, 9, 8, 4]

You can use

any Python

expression

in a list

comprehension.

>>> [elem * 2 for elem in a_list]

[2, 18, 16, 8]

>>> a_list

[1, 9, 8, 4]

>>> a_list = [elem * 2 for elem in a_list]

>>> a_list

[2, 18, 16, 8]

1. To make sense of this, look at it from right to left. a_list is the list you’re mapping. The Python

interpreter loops through a_list one element at a time, temporarily assigning the value of each element to

the variable elem. Python then applies the function elem * 2 and appends that result to the returned list.

2. A list comprehension creates a new list; it does not change the original list.

3. It is safe to assign the result of a list comprehension to the variable that you’re mapping. Python constructs

the new list in memory, and when the list comprehension is complete, it assigns the result to the original

variable.

99

You can use any Python expression in a list comprehension, including the functions in the os module for

manipulating files and directories.

>>> import os, glob

>>> glob.glob('*.xml')

['feed-broken.xml', 'feed-ns0.xml', 'feed.xml']

>>> [os.path.realpath(f) for f in glob.glob('*.xml')]

['c:\\Users\\pilgrim\\diveintopython3\\examples\\feed-broken.xml',

'c:\\Users\\pilgrim\\diveintopython3\\examples\\feed-ns0.xml',

'c:\\Users\\pilgrim\\diveintopython3\\examples\\feed.xml']

1. This returns a list of all the .xml files in the current working directory.

2. This list comprehension takes that list of .xml files and transforms it into a list of full pathnames.

List comprehensions can also filter items, producing a result that can be smaller than the original list.

>>> import os, glob

>>> [f for f in glob.glob('*.py') if os.stat(f).st_size > 6000]

['pluraltest6.py',

'romantest10.py',

'romantest6.py',

'romantest7.py',

'romantest8.py',

'romantest9.py']

1. To filter a list, you can include an if clause at the end of the list comprehension. The expression after the

if keyword will be evaluated for each item in the list. If the expression evaluates to True, the item will be

included in the output. This list comprehension looks at the list of all .py files in the current directory, and

the if expression filters that list by testing whether the size of each file is greater than 6000 bytes. There

are six such files, so the list comprehension returns a list of six filenames.

All the examples of list comprehensions so far have featured simple expressions — multiply a number by a

constant, call a single function, or simply return the original list item (after filtering). But there’s no limit to

how complex a list comprehension can be.

100

>>> import os, glob

>>> [(os.stat(f).st_size, os.path.realpath(f)) for f in glob.glob('*.xml')]

[(3074, 'c:\\Users\\pilgrim\\diveintopython3\\examples\\feed-broken.xml'),

(3386, 'c:\\Users\\pilgrim\\diveintopython3\\examples\\feed-ns0.xml'),

(3070, 'c:\\Users\\pilgrim\\diveintopython3\\examples\\feed.xml')]

>>> import humansize

>>> [(humansize.approximate_size(os.stat(f).st_size), f) for f in glob.glob('*.xml')]

[('3.0 KiB', 'feed-broken.xml'),

('3.3 KiB', 'feed-ns0.xml'),

('3.0 KiB', 'feed.xml')]

1. This list comprehension finds all the .xml files in the current working directory, gets the size of each file (by

calling the os.stat() function), and constructs a tuple of the file size and the absolute path of each file (by

calling the os.path.realpath() function).

2. This comprehension builds on the previous one to call the approximate_size() function with the file size of each .xml file.

3.4. DICTIONARY COMPREHENSIONS

A dictionary comprehension is like a list comprehension, but it constructs a dictionary instead of a list.

101

>>> import os, glob

>>> metadata = [(f, os.stat(f)) for f in glob.glob('*test*.py')]

>>> metadata[0]

('alphameticstest.py', nt.stat_result(st_mode=33206, st_ino=0, st_dev=0,

st_nlink=0, st_uid=0, st_gid=0, st_size=2509, st_atime=1247520344,

st_mtime=1247520344, st_ctime=1247520344))

>>> metadata_dict = {f:os.stat(f) for f in glob.glob('*test*.py')}

>>> type(metadata_dict)

<class 'dict'>

>>> list(metadata_dict.keys())

['romantest8.py', 'pluraltest1.py', 'pluraltest2.py', 'pluraltest5.py',

'pluraltest6.py', 'romantest7.py', 'romantest10.py', 'romantest4.py',

'romantest9.py', 'pluraltest3.py', 'romantest1.py', 'romantest2.py',

'romantest3.py', 'romantest5.py', 'romantest6.py', 'alphameticstest.py',

'pluraltest4.py']

>>> metadata_dict['alphameticstest.py'].st_size

2509

1. This is not a dictionary comprehension; it’s a list comprehension. It finds all .py files with test in their name, then constructs a tuple of the filename and the file metadata (from calling the os.stat() function).

2. Each item of the resulting list is a tuple.

3. This is a dictionary comprehension. The syntax is similar to a list comprehension, with two differences. First,

it is enclosed in curly braces instead of square brackets. Second, instead of a single expression for each item,

it contains two expressions separated by a colon. The expression before the colon (f in this example) is the

dictionary key; the expression after the colon (os.stat(f) in this example) is the value.

4. A dictionary comprehension returns a dictionary.

5. The keys of this particular dictionary are simply the filenames returned from the call to

glob.glob('*test*.py').

6. The value associated with each key is the return value from the os.stat() function. That means we can

“look up” a file by name in this dictionary to get its file metadata. One of the pieces of metadata is st_size,

the file size. The file alphameticstest.py is 2509 bytes long.

Like list comprehensions, you can include an if clause in a dictionary comprehension to filter the input

sequence based on an expression which is evaluated with each item.

102

>>> import os, glob, humansize

>>> metadata_dict = {f:os.stat(f) for f in glob.glob('*')}

>>> humansize_dict = {os.path.splitext(f)[0]:humansize.approximate_size(meta.st_size) \

...

for f, meta in metadata_dict.items() if meta.st_size > 6000}

>>> list(humansize_dict.keys())

['romantest9', 'romantest8', 'romantest7', 'romantest6', 'romantest10', 'pluraltest6']

>>> humansize_dict['romantest9']

'6.5 KiB'

1. This dictionary comprehension constructs a list of all the files in the current working directory

(glob.glob('*')), gets the file metadata for each file (os.stat(f)), and constructs a dictionary whose keys

are filenames and whose values are the metadata for each file.

2. This dictionary comprehension builds on the previous comprehension, filters out files smaller than 6000 bytes

(if meta.st_size > 6000), and uses that filtered list to construct a dictionary whose keys are the filename

minus the extension (os.path.splitext(f)[0]) and whose values are the approximate size of each file

(humansize.approximate_size(meta.st_size)).

3. As you saw in a previous example, there are six such files, thus there are six items in this dictionary.

4. The value of each key is the string returned from the approximate_size() function.

3.4.1. OTHER FUN STUFF TO DO WITH DICTIONARY COMPREHENSIONS

Here’s a trick with dictionary comprehensions that might be useful someday: swapping the keys and values of

a dictionary.

>>> a_dict = {'a': 1, 'b': 2, 'c': 3}

>>> {value:key for key, value in a_dict.items()}

{1: 'a', 2: 'b', 3: 'c'}

Of course, this only works if the values of the dictionary are immutable, like strings or tuples. If you try this

with a dictionary that contains lists, it will fail most spectacularly.

103

>>> a_dict = {'a': [1, 2, 3], 'b': 4, 'c': 5}

>>> {value:key for key, value in a_dict.items()}

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "<stdin>", line 1, in <dictcomp>

TypeError: unhashable type: 'list'

3.5. SET COMPREHENSIONS

Not to be left out, sets have their own comprehension syntax as well. It is remarkably similar to the syntax

for dictionary comprehensions. The only difference is that sets just have values instead of key:value pairs.

>>> a_set = set(range(10))

>>> a_set

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

>>> {x ** 2 for x in a_set}

{0, 1, 4, 81, 64, 9, 16, 49, 25, 36}

>>> {x for x in a_set if x % 2 == 0}

{0, 8, 2, 4, 6}

>>> {2**x for x in range(10)}

{32, 1, 2, 4, 8, 64, 128, 256, 16, 512}

1. Set comprehensions can take a set as input. This set comprehension calculates the squares of the set of

numbers from 0 to 9.

2. Like list comprehensions and dictionary comprehensions, set comprehensions can contain an if clause to

filter each item before returning it in the result set.

3. Set comprehensions do not need to take a set as input; they can take any sequence.

104

3.6. FURTHER READING

os module

os — Portable access to operating system specific features

os.path module

os.path — Platform-independent manipulation of file names

glob module

glob — Filename pattern matching

time module

time — Functions for manipulating clock time

List comprehensions

Nested list comprehensions

Looping techniques

105

CHAPTER 4. STRINGS

I’m telling you this ’cause you’re one of my friends.

My alphabet starts where your alphabet ends!

— Dr. Seuss, On Beyond Zebra!

4.1. SOME BORING STUFF YOU NEED TO UNDERSTAND BEFORE YOU

CAN DIVE IN

Fewpeoplethinkaboutit,buttextisincrediblycomplicated.Startwiththealphabet.Thepeopleof

Bougainville have the smallest alphabet in the world; their Rotokas alphabet is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese,

and Korean have thousands of characters. English, of course, has 26 letters — 52 if you count uppercase and

lowercase separately — plus a handful of !@#$%& punctuation marks.

When you talk about “text,” you’re probably thinking of “characters and symbols on my computer screen.”

But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve

ever seen on a computer screen is actually stored in a particular character encoding. Very roughly speaking,

the character encoding provides a mapping between the stuff you see on your screen and the stuff your

computer actually stores in memory and on disk. There are many different character encodings, some

optimized for particular languages like Russian or Chinese or English, and others that can be used for

multiple languages.

In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each

encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So

you can think of the character encoding as a kind of decryption key. Whenever someone gives you a

sequence of bytes — a file, a web page, whatever — and claims it’s “text,” you need to know what character

encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key

at all, you’re left with the unenviable task of cracking the code yourself. Chances are you’ll get it wrong, and

the result will be gibberish.

106

Surely you’ve seen web pages like this, with strange

question-mark-like characters where apostrophes should

be. That usually means the page author didn’t declare

their character encoding correctly, your browser was

left guessing, and the result was a mix of expected and

Everything

unexpected characters. In English it’s merely annoying; in

other languages, the result can be completely

you thought

unreadable.

you knew

There are character encodings for each major language

in the world. Since each language is different, and

about

memory and disk space have historically been expensive,

each character encoding is optimized for a particular

strings is

language. By that, I mean each encoding using the same

numbers (0–255) to represent that language’s characters.

wrong.

For instance, you’re probably familiar with the ASCII

encoding, which stores English characters as numbers

ranging from 0 to 127. (65 is capital “A”, 97 is

lowercase “a”, & c.) English has a very simple alphabet,

so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2,

that’s 7 out of the 8 bits in a byte.

Western European languages like French, Spanish, and German have more letters than English. Or, more

precisely, they have letters combined with various diacritical marks, like the ñ character in Spanish. The most

common encoding for these languages is CP-1252, also called “windows-1252” because it is widely used on

Microsoft Windows. The CP-1252 encoding shares characters with ASCII in the 0–127 range, but then

extends into the 128–255 range for characters like n-with-a-tilde-over-it (241), u-with-two-dots-over-it (252),

& c. It’s still a single-byte encoding, though; the highest possible number, 255, still fits in one byte.

Then there are languages like Chinese, Japanese, and Korean, which have so many characters that they

require multiple-byte character sets. That is, each “character” is represented by a two-byte number from

0–65535. But different multi-byte encodings still share the same problem as different single-byte encodings,

namely that they each use the same numbers to mean different things. It’s just that the range of numbers is

broader, because there are many more characters to represent.

107

That was mostly OK in a non-networked world, where “text” was something you typed yourself and

occasionally printed. There wasn’t much “plain text”. Source code was ASCII, and everyone else used word

processors, which defined their own (non-text) formats that tracked character encoding information along

with rich styling, & c. People read these documents with the same word processing program as the original

author, so everything worked, more or less.

Now think about the rise of global networks like email and the web. Lots of “plain text” flying around the

globe, being authored on one computer, transmitted through a second computer, and received and displayed

by a third computer. Computers can only see numbers, but the numbers could mean different things. Oh no!

What to do? Well, systems had to be designed to carry encoding information along with every piece of

“plain text.” Remember, it’s the decryption key that maps computer-readable numbers to human-readable

characters. A missing decryption key means garbled text, gibberish, or worse.

Now think about trying to store multiple pieces of text in the same place, like in the same database table

that holds all the email you’ve ever received. You still need to store the character encoding alongside each

piece of text so you can display it properly. Think that’s hard? Try searching your email database, which

means converting between multiple encodings on the fly. Doesn’t that sound fun?

Now think about the possibility of multilingual documents, where characters from several languages are next

to each other in the same document. (Hint: programs that tried to do this typically used escape codes to

switch “modes.” Poof, you’re in Russian koi8-r mode, so 241 means Я; poof, now you’re in Mac Greek

mode, so 241 means ώ.) And of course you’ll want to search those documents, too.

Now cry a lot, because everything you thought you knew about strings is wrong, and there ain’t no such

thing as “plain text.”

4.2. UNICODE

Enter Unicode.

108

Unicode is a system designed to represent every character from every language. Unicode represents each

letter, character, or ideograph as a 4-byte number. Each number represents a unique character used in at

least one of the world’s languages. (Not all the numbers are used, but more than 65535 of them are, so 2

bytes wouldn’t be sufficient.) Characters that are used in multiple languages generally have the same number,

unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and

exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep

track of. U+0041 is always 'A', even if your language doesn’t have an 'A' in it.

On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per

document. No more “mode switching” to switch between encodings mid-stream. But right away, the obvious

question should leap out at you. Four bytes? For every single character‽ That seems awfully wasteful,

especially for languages like English and Spanish, which need less than one byte (256 numbers) to express

every possible character. In fact, it’s wasteful even for ideograph-based languages (like Chinese), which never

need more than two bytes per character.

There is a Unicode encoding that uses four bytes per character. It’s called UTF-32, because 32 bits = 4

bytes. UTF-32 is a straightforward encoding; it takes each Unicode character (a 4-byte number) and

represents the character with that same number. This has some advantages, the most important being that

you can find the Nth character of a string in constant time, because the Nth character starts at the 4×Nth

byte. It also has several disadvantages, the most obvious being that it takes four freaking bytes to store every

freaking character.

Even though there are a lot of Unicode characters, it turns out that most people will never use anything

beyond the first 65535. Thus, there is another Unicode encoding, called UTF-16 (because 16 bits = 2 bytes).

UTF-16 encodes every character from 0–65535 as two bytes, then uses some dirty hacks if you actually need

to represent the rarely-used “astral plane” Unicode characters beyond 65535. Most obvious advantage:

UTF-16 is twice as space-efficient as UTF-32, because every character requires only two bytes to store

instead of four bytes (except for the ones that don’t). And you can still easily find the Nth character of a

string in constant time, if you assume that the string doesn’t include any astral plane characters, which is a

good assumption right up until the moment that it’s not.

But there are also non-obvious disadvantages to both UTF-32 and UTF-16. Different computer systems store

individual bytes in different ways. That means that the character U+4E2D could be stored in UTF-16 as either

4E 2D or 2D 4E, depending on whether the system is big-endian or little-endian. (For UTF-32, there are even

109

more possible byte orderings.) As long as your documents never leave your computer, you’re

safe — different applications on the same computer will all use the same byte order. But the minute you

want to transfer documents between systems, perhaps on a world wide web of some sort, you’re going to

need a way to indicate which order your bytes are stored. Otherwise, the receiving system has no way of

knowing whether the two-byte sequence 4E 2D means U+4E2D or U+2D4E.

To solve this problem, the multi-byte Unicode encodings define a “Byte Order Mark,” which is a special non-

printable character that you can include at the beginning of your document to indicate what order your

bytes are in. For UTF-16, the Byte Order Mark is U+FEFF. If you receive a UTF-16 document that starts

with the bytes FF FE, you know the byte ordering is one way; if it starts with FE FF, you know the byte

ordering is reversed.

Still, UTF-16 isn’t exactly ideal, especially if you’re dealing with a lot of ASCII characters. If you think about

it, even a Chinese web page is going to contain a lot of ASCII characters — all the elements and attributes

surrounding the printable Chinese characters. Being able to find the Nth character in constant time is nice,

but there’s still the nagging problem of those astral plane characters, which mean that you can’t guarantee

that every character is exactly two bytes, so you can’t really find the Nth character in constant time unless

you maintain a separate index. And boy, there sure is a lot of ASCII text in the world…

Other people pondered these questions, and they came up with a solution:

UTF-8

110

UTF-8 is a variable-length encoding system for Unicode. That is, different characters take up a different

number of bytes. For ASCII characters (A-Z, & c.) UTF-8 uses just one byte per character. In fact, it uses

the exact same bytes; the first 128 characters (0–127) in UTF-8 are indistinguishable from ASCII. “Extended

Latin” characters like ñ and ö end up taking two bytes. (The bytes are not simply the Unicode code point

like they would be in UTF-16; there is some serious bit-twiddling involved.) Chinese characters like 中 end

up taking three bytes. The rarely-used “astral plane” characters take four bytes.

Disadvantages: because each character can take a different number of bytes, finding the Nth character is an

O(N) operation — that is, the longer the string, the longer it takes to find a specific character. Also, there is

bit-twiddling involved to encode characters into bytes and decode bytes into characters.

Advantages: super-efficient encoding of common ASCII characters. No worse than UTF-16 for extended Latin

characters. Better than UTF-32 for Chinese characters. Also (and you’ll have to trust me on this, because

I’m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-

ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer.

4.3. DIVING IN

In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string

encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question.

U T F -8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a

sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a

sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters;

bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.

111

>>> s = '深入 Python'

>>> len(s)

9

>>> s[0]

'深'

>>> s + ' 3'

'深入 Python 3'

1. To create a string, enclose it in quotes. Python strings can be defined with either single quotes (') or double

quotes (").

2. The built-in len() function returns the length of the string, i.e. the number of characters. This is the same

function you use to find the length of a list, tuple, set, or dictionary. A string is like a tuple of characters.

3. Just like getting individual items out of a list, you can get individual characters out of a string using index

notation.

4. Just like lists, you can concatenate strings using the + operator.

4.4. FORMATTING STRINGS

Let’s take another look at humansize.py:

Strings can

be defined

112

with either

single or

double

quotes.

113

SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],

1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}

def approximate_size(size, a_kilobyte_is_1024_bytes=True):

'''Convert a file size to human-readable form.

Keyword arguments:

size -- file size in bytes

a_kilobyte_is_1024_bytes -- if True (default), use multiples of 1024

if False, use multiples of 1000

Returns: string

'''

if size < 0:

raise ValueError('number must be non-negative')

multiple = 1024 if a_kilobyte_is_1024_bytes else 1000

for suffix in SUFFIXES[multiple]:

size /= multiple

if size < multiple:

return '{0:.1f} {1}'.format(size, suffix)

raise ValueError('number too large')

1. 'KB', 'MB', 'GB'… those are each strings.

2. Function docstrings are strings. This docstring spans multiple lines, so it uses three-in-a-row quotes to start

and end the string.

3. These three-in-a-row quotes end the docstring.

4. There’s another string, being passed to the exception as a human-readable error message.

5. There’s a… whoa, what the heck is that?

Python 3 supports formatting values into strings. Although this can include very complicated expressions, the

most basic usage is to insert a value into a string with a single placeholder.

114

>>> username = 'mark'

>>> password = 'PapayaWhip'

>>> "{0}'s password is {1}".format(username, password)

"mark's password is PapayaWhip"

1. No, my password is not really PapayaWhip.

2. There’s a lot going on here. First, that’s a method call on a string literal. Strings are objects, and objects have

methods. Second, the whole expression evaluates to a string. Third, {0} and {1} are replacement fields, which

are replaced by the arguments passed to the format() method.

4.4.1. COMPOUND FIELD NAMES

The previous example shows the simplest case, where the replacement fields are simply integers. Integer

replacement fields are treated as positional indices into the argument list of the format() method. That

means that {0} is replaced by the first argument (username in this case), {1} is replaced by the second

argument (password), & c. You can have as many positional indices as you have arguments, and you can have

as many arguments as you want. But replacement fields are much more powerful than that.

>>> import humansize

>>> si_suffixes = humansize.SUFFIXES[1000]

>>> si_suffixes

['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']

>>> '1000{0[0]} = 1{0[1]}'.format(si_suffixes)

'1000KB = 1MB'

1. Rather than calling any function in the humansize module, you’re just grabbing one of the data structures it

defines: the list of “SI” (powers-of-1000) suffixes.

2. This looks complicated, but it’s not. {0} would refer to the first argument passed to the format() method,

si_suffixes. But si_suffixes is a list. So {0[0]} refers to the first item of the list which is the first

argument passed to the format() method: 'KB'. Meanwhile, {0[1]} refers to the second item of the same

list: 'MB'. Everything outside the curly braces — including 1000, the equals sign, and the spaces — is

untouched. The final result is the string '1000KB = 1MB'.

115

What this example shows is that format specifiers can

access items and properties of data structures using (almost)

Python syntax. This is called compound field names. The

following compound field names “just work”:

{0} is

• Passing a list, and accessing an item of the list by index

(as in the previous example)

replaced by

• Passing a dictionary, and accessing a value of the

dictionary by key

the 1st

• Passing a module, and accessing its variables and

functions by name

format()

• Passing a class instance, and accessing its properties and

methods by name

argument.

Any combination of the above

{1} is

Just to blow your mind, here’s an example that

combines all of the above:

replaced by

>>> import humansize

>>> import sys

the 2nd.

>>> '1MB = 1000{0.modules[humansize].SUFFIXES[1000][0]}'.format(sys)

'1MB = 1000KB'

Here’s how it works:

• The sys module holds information about the currently running Python instance. Since you just imported it,

you can pass the sys module itself as an argument to the format() method. So the replacement field {0}

refers to the sys module.

• sys.modules is a dictionary of all the modules that have been imported in this Python instance. The keys

are the module names as strings; the values are the module objects themselves. So the replacement field

{0.modules} refers to the dictionary of imported modules.

116

• sys.modules['humansize'] is the humansize module which you just imported. The replacement field

{0.modules[humansize]} refers to the humansize module. Note the slight difference in syntax here. In real

Python code, the keys of the sys.modules dictionary are strings; to refer to them, you need to put quotes

around the module name ( e.g. 'humansize'). But within a replacement field, you skip the quotes around the

dictionary key name ( e.g. humansize). To quote PEP 3101: Advanced String Formatting, “The rules for parsing an item key are very simple. If it starts with a digit, then it is treated as a number, otherwise it is

used as a string.”

• sys.modules['humansize'].SUFFIXES is the dictionary defined at the top of the humansize module. The

replacement field {0.modules[humansize].SUFFIXES} refers to that dictionary.

• sys.modules['humansize'].SUFFIXES[1000] is a list of SI suffixes: ['KB', 'MB', 'GB', 'TB', 'PB',

'EB', 'ZB', 'YB']. So the replacement field {0.modules[humansize].SUFFIXES[1000]} refers to that list.

• sys.modules['humansize'].SUFFIXES[1000][0] is the first item of the list of SI suffixes: 'KB'. Therefore,

the complete replacement field {0.modules[humansize].SUFFIXES[1000][0]} is replaced by the two-

character string KB.

4.4.2. FORMAT SPECIFIERS

But wait! There’s more! Let’s take another look at that strange line of code from humansize.py:

if size < multiple:

return '{0:.1f} {1}'.format(size, suffix)

{1} is replaced with the second argument passed to the format() method, which is suffix. But what is

{0:.1f}? It’s two things: {0}, which you recognize, and :.1f, which you don’t. The second half (including

and after the colon) defines the format specifier, which further refines how the replaced variable should be

formatted.

☞ Format specifiers allow you to munge the replacement text in a variety of useful

ways, like the printf() function in C. You can add zero- or space-padding, align

strings, control decimal precision, and even convert numbers to hexadecimal.

117

Within a replacement field, a colon (:) marks the start of the format specifier. The format specifier “.1”

means “round to the nearest tenth” ( i.e. display only one digit after the decimal point). The format specifier

“f” means “fixed-point number” (as opposed to exponential notation or some other decimal representation).

Thus, given a size of 698.24 and suffix of 'GB', the formatted string would be '698.2 GB', because

698.24 gets rounded to one decimal place, then the suffix is appended after the number.

>>> '{0:.1f} {1}'.format(698.24, 'GB')

'698.2 GB'

For all the gory details on format specifiers, consult the Format Specification Mini-Language in the official Python documentation.

4.5. OTHER COMMON STRING METHODS

Besides formatting, strings can do a number of other useful tricks.

118

>>> s = '''Finished files are the re-

... sult of years of scientif-

... ic study combined with the

... experience of years.'''

>>> s.splitlines()

['Finished files are the re-',

'sult of years of scientif-',

'ic study combined with the',

'experience of years.']

>>> print(s.lower())

finished files are the re-

sult of years of scientif-

ic study combined with the

experience of years.

>>> s.lower().count('f')

6

1. You can input multiline strings in the Python interactive shell. Once you start a multiline string with triple

quotation marks, just hit ENTER and the interactive shell will prompt you to continue the string. Typing the

closing triple quotation marks ends the string, and the next ENTER will execute the command (in this case,

assigning the string to s).

2. The splitlines() method takes one multiline string and returns a list of strings, one for each line of the

original. Note that the carriage returns at the end of each line are not included.

3. The lower() method converts the entire string to lowercase. (Similarly, the upper() method converts a

string to uppercase.)

4. The count() method counts the number of occurrences of a substring. Yes, there really are six “f”s in that

sentence!

Here’s another common case. Let’s say you have a list of key-value pairs in the form

key1=value1&key2=value2, and you want to split them up and make a dictionary of the form {key1:

value1, key2: value2}.

119

>>> query = 'user=pilgrim&database=master&password=PapayaWhip'

>>> a_list = query.split('&')

>>> a_list

['user=pilgrim', 'database=master', 'password=PapayaWhip']

>>> a_list_of_lists = [v.split('=', 1) for v in a_list if '=' in v]

>>> a_list_of_lists

[['user', 'pilgrim'], ['database', 'master'], ['password', 'PapayaWhip']]

>>> a_dict = dict(a_list_of_lists)

>>> a_dict

{'password': 'PapayaWhip', 'user': 'pilgrim', 'database': 'master'}

1. The split() string method has one required argument, a delimiter. The method splits a string into a list of

strings based on the delimiter. Here, the delimiter is an ampersand character, but it could be anything.

2. Now we have a list of strings, each with a key, followed by an equals sign, followed by a value. We can use

a list comprehension to iterate over the entire list and split each string into two strings based on the first equals sign. The optional second argument to the split() method is the number of times you want to split.

1 means “only split once,” so the split() method will return a two-item list. (In theory, a value could

contain an equals sign too. If you just used 'key=value=foo'.split('='), you would end up with a three-

item list ['key', 'value', 'foo'].)

3. Finally, Python can turn that list-of-lists into a dictionary simply by passing it to the dict() function.

☞ The previous example looks a lot like parsing query parameters in a URL, but real-life

U R L parsing is actually more complicated than this. If you’re dealing with U R L query

parameters, you’re better off using the urllib.parse.parse_qs() function, which handles some non-obvious edge cases.

4.5.1. SLICING A STRING

Once you’ve defined a string, you can get any part of it as a new string. This is called slicing the string. Slicing

strings works exactly the same as slicing lists, which makes sense, because strings are just sequences of characters.

120

>>> a_string = 'My alphabet starts where your alphabet ends.'

>>> a_string[3:11]

'alphabet'

>>> a_string[3:-3]

'alphabet starts where your alphabet en'

>>> a_string[0:2]

'My'

>>> a_string[:18]

'My alphabet starts'

>>> a_string[18:]

' where your alphabet ends.'

1. You can get a part of a string, called a “slice”, by specifying two indices. The return value is a new string

containing all the characters of the string, in order, starting with the first slice index.

2. Like slicing lists, you can use negative indices to slice strings.

3. Strings are zero-based, so a_string[0:2] returns the first two items of the string, starting at a_string[0],

up to but not including a_string[2].

4. If the left slice index is 0, you can leave it out, and 0 is implied. So a_string[:18] is the same as

a_string[0:18], because the starting 0 is implied.

5. Similarly, if the right slice index is the length of the string, you can leave it out. So a_string[18:] is the

same as a_string[18:44], because this string has 44 characters. There is a pleasing symmetry here. In this

44-character string, a_string[:18] returns the first 18 characters, and a_string[18:] returns everything

but the first 18 characters. In fact, a_string[:n] will always return the first n characters, and a_string[n:]

will return the rest, regardless of the length of the string.

4.6. STRINGS VS. BYTES

Bytes are bytes; characters are an abstraction. An immutable sequence of Unicode characters is called a

string. An immutable sequence of numbers-between-0-and-255 is called a bytes object.

121

>>> by = b'abcd\x65'

>>> by

b'abcde'

>>> type(by)

<class 'bytes'>

>>> len(by)

5

>>> by += b'\xff'

>>> by

b'abcde\xff'

>>> len(by)

6

>>> by[0]

97

>>> by[0] = 102

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

TypeError: 'bytes' object does not support item assignment

1. To define a bytes object, use the b'' “byte literal” syntax. Each byte within the byte literal can be an ASCII

character or an encoded hexadecimal number from \x00 to \xff (0–255).

2. The type of a bytes object is bytes.

3. Just like lists and strings, you can get the length of a bytes object with the built-in len() function.

4. Just like lists and strings, you can use the + operator to concatenate bytes objects. The result is a new

bytes object.

5. Concatenating a 5-byte bytes object and a 1-byte bytes object gives you a 6-byte bytes object.

6. Just like lists and strings, you can use index notation to get individual bytes in a bytes object. The items of a

string are strings; the items of a bytes object are integers. Specifically, integers between 0–255.

7. A bytes object is immutable; you can not assign individual bytes. If you need to change individual bytes, you

can either use string slicing and concatenation operators (which work the same as strings), or you can convert the bytes object into a bytearray object.

122

>>> by = b'abcd\x65'

>>> barr = bytearray(by)

>>> barr

bytearray(b'abcde')

>>> len(barr)

5

>>> barr[0] = 102

>>> barr

bytearray(b'fbcde')

1. To convert a bytes object into a mutable bytearray object, use the built-in bytearray() function.

2. All the methods and operations you can do on a bytes object, you can do on a bytearray object too.

3. The one difference is that, with the bytearray object, you can assign individual bytes using index notation.

The assigned value must be an integer between 0–255.

The one thing you can never do is mix bytes and strings.

>>> by = b'd'

>>> s = 'abcde'

>>> by + s

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

TypeError: can't concat bytes to str

>>> s.count(by)

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

TypeError: Can't convert 'bytes' object to str implicitly

>>> s.count(by.decode('ascii'))

1

1. You can’t concatenate bytes and strings. They are two different data types.

2. You can’t count the occurrences of bytes in a string, because there are no bytes in a string. A string is a

sequence of characters. Perhaps you meant “count the occurrences of the string that you would get after

123

decoding this sequence of bytes in a particular character encoding”? Well then, you’ll need to say that

explicitly. Python 3 won’t implicitly convert bytes to strings or strings to bytes.

3. By an amazing coincidence, this line of code says “count the occurrences of the string that you would get

after decoding this sequence of bytes in this particular character encoding.”

And here is the link between strings and bytes: bytes objects have a decode() method that takes a

character encoding and returns a string, and strings have an encode() method that takes a character

encoding and returns a bytes object. In the previous example, the decoding was relatively

straightforward — converting a sequence of bytes in the ASCII encoding into a string of characters. But the

same process works with any encoding that supports the characters of the string — even legacy (non-

Unicode) encodings.

124

>>> a_string = '深入 Python'

>>> len(a_string)

9

>>> by = a_string.encode('utf-8')

>>> by

b'\xe6\xb7\xb1\xe5\x85\xa5 Python'

>>> len(by)

13

>>> by = a_string.encode('gb18030')

>>> by

b'\xc9\xee\xc8\xeb Python'

>>> len(by)

11

>>> by = a_string.encode('big5')

>>> by

b'\xb2`\xa4J Python'

>>> len(by)

11

>>> roundtrip = by.decode('big5')

>>> roundtrip

'深入 Python'

>>> a_string == roundtrip

True

1. This is a string. It has nine characters.

2. This is a bytes object. It has 13 bytes. It is the sequence of bytes you get when you take a_string and

encode it in UTF-8.

3. This is a bytes object. It has 11 bytes. It is the sequence of bytes you get when you take a_string and

encode it in GB18030.

4. This is a bytes object. It has 11 bytes. It is an entirely different sequence of bytes that you get when you take

a_string and encode it in Big5.

5. This is a string. It has nine characters. It is the sequence of characters you get when you take by and decode

it using the Big5 encoding algorithm. It is identical to the original string.

125

4.7. POSTSCRIPT: CHARACTER ENCODING OF PYTHON SOURCE CODE

Python 3 assumes that your source code — i.e. each .py file — is encoded in UTF-8.

☞ In Python 2, the default encoding for .py files was ASCII. In Python 3, the default

encoding is UTF-8.

If you would like to use a different encoding within your Python code, you can put an encoding declaration

on the first line of each file. This declaration defines a .py file to be windows-1252:

# -*- coding: windows-1252 -*-

Technically, the character encoding override can also be on the second line, if the first line is a UNIX-like

hash-bang command.

#!/usr/bin/python3

# -*- coding: windows-1252 -*-

For more information, consult PEP 263: Defining Python Source Code Encodings.

4.8. FURTHER READING

On Unicode in Python:

Python Unicode HOWTO

126

What’s New In Python 3: Text vs. Data Instead Of Unicode vs. 8-bit

PEP 261 explains how Python handles astral characters outside of the Basic Multilingual Plane ( i.e. characters whose ordinal value is greater than 65535)

On Unicode in general:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and

Character Sets (No Excuses!)

On the Goodness of Unicode

On Character Strings

Characters vs. Bytes

On character encoding in other formats:

Character encoding in XML

Character encoding in HTML

On strings and string formatting:

string — Common string operations

Format String Syntax

Format Specification Mini-Language

PEP 3101: Advanced String Formatting

127

CHAPTER 5. REGULAR EXPRESSIONS

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two

problems.

Jamie Zawinski

5.1. DIVING IN

Gettingasmallbitoftextoutofalargeblockoftextisachallenge.InPython,stringshavemethods

for searching and replacing: index(), find(), split(), count(), replace(), & c. But these methods are

limited to the simplest of cases. For example, the index() method looks for a single, hard-coded substring,

and the search is always case-sensitive. To do case-insensitive searches of a string s, you must call

s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The

replace() and split() methods have the same limitations.

If your goal can be accomplished with string methods, you should use them. They’re fast and simple and easy

to read, and there’s a lot to be said for fast, simple, readable code. But if you find yourself using a lot of

different string functions with if statements to handle special cases, or if you’re chaining calls to split()

and join() to slice-and-dice your strings, you may need to move up to regular expressions.

Regular expressions are a powerful and (mostly) standardized way of searching, replacing, and parsing text

with complex patterns of characters. Although the regular expression syntax is tight and unlike normal code,

the result can end up being more readable than a hand-rolled solution that uses a long chain of string

functions. There are even ways of embedding comments within regular expressions, so you can include fine-

grained documentation within them.

☞ If you’ve used regular expressions in other languages (like Perl, JavaScript, or PHP),

Python’s syntax will be very familiar. Read the summary of the re module to get an overview of the available functions and their arguments.

128

5.2. CASE STUDY: STREET ADDRESSES

This series of examples was inspired by a real-life problem I had in my day job several years ago, when I

needed to scrub and standardize street addresses exported from a legacy system before importing them into

a newer system. (See, I don’t just make this stuff up; it’s actually useful.) This example shows how I

approached the problem.

>>> s = '100 NORTH MAIN ROAD'

>>> s.replace('ROAD', 'RD.')

'100 NORTH MAIN RD.'

>>> s = '100 NORTH BROAD ROAD'

>>> s.replace('ROAD', 'RD.')

'100 NORTH BRD. RD.'

>>> s[:-4] + s[-4:].replace('ROAD', 'RD.')

'100 NORTH BROAD RD.'

>>> import re

>>> re.sub('ROAD$', 'RD.', s)

'100 NORTH BROAD RD.'

1. My goal is to standardize a street address so that 'ROAD' is always abbreviated as 'RD.'. At first glance, I

thought this was simple enough that I could just use the string method replace(). After all, all the data was

already uppercase, so case mismatches would not be a problem. And the search string, 'ROAD', was a

constant. And in this deceptively simple example, s.replace() does indeed work.

2. Life, unfortunately, is full of counterexamples, and I quickly discovered this one. The problem here is that

'ROAD' appears twice in the address, once as part of the street name 'BROAD' and once as its own word.

The replace() method sees these two occurrences and blindly replaces both of them; meanwhile, I see my

addresses getting destroyed.

3. To solve the problem of addresses with more than one 'ROAD' substring, you could resort to something like

this: only search and replace 'ROAD' in the last four characters of the address (s[-4:]), and leave the string

alone (s[:-4]). But you can see that this is already getting unwieldy. For example, the pattern is dependent

on the length of the string you’re replacing. (If you were replacing 'STREET' with 'ST.', you would need to

129

use s[:-6] and s[-6:].replace(...).) Would you like to come back in six months and debug this? I know

I wouldn’t.

4. It’s time to move up to regular expressions. In Python, all functionality related to regular expressions is

contained in the re module.

5. Take a look at the first parameter: 'ROAD$'. This is a simple regular expression that matches 'ROAD' only

when it occurs at the end of a string. The $ means “end of the string.” (There is a corresponding character,

the caret ^, which means “beginning of the string.”) Using the re.sub() function, you search the string s for

the regular expression 'ROAD$' and replace it with 'RD.'. This matches the ROAD at the end of the string s,

but does not match the ROAD that’s part of the word BROAD, because that’s in the middle of s.

Continuing with my story of scrubbing addresses, I soon

discovered that the previous example, matching 'ROAD'

at the end of the address, was not good enough,

because not all addresses included a street designation

at all. Some addresses simply ended with the street

^ matches

name. I got away with it most of the time, but if the

street name was 'BROAD', then the regular expression

the start of

would match 'ROAD' at the end of the string as part of

the word 'BROAD', which is not what I wanted.

a string. $

>>> s = '100 BROAD'

>>> re.sub('ROAD$', 'RD.', s)

matches the

'100 BRD.'

end of a

>>> re.sub('\\bROAD$', 'RD.', s)

'100 BROAD'

string.

>>> re.sub(r'\bROAD$', 'RD.', s)

'100 BROAD'

>>> s = '100 BROAD ROAD APT. 3'

>>> re.sub(r'\bROAD$', 'RD.', s)

'100 BROAD ROAD APT. 3'

>>> re.sub(r'\bROAD\b', 'RD.', s)

'100 BROAD RD. APT 3'

130

1. What I really wanted was to match 'ROAD' when it was at the end of the string and it was its own word

(and not a part of some larger word). To express this in a regular expression, you use \b, which means “a

word boundary must occur right here.” In Python, this is complicated by the fact that the '\' character in a

string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason why

regular expressions are easier in Perl than in Python. On the down side, Perl mixes regular expressions with

other syntax, so if you have a bug, it may be hard to tell whether it’s a bug in syntax or a bug in your

regular expression.

2. To work around the backslash plague, you can use what is called a raw string, by prefixing the string with the

letter r. This tells Python that nothing in this string should be escaped; '\t' is a tab character, but r'\t' is

really the backslash character \ followed by the letter t. I recommend always using raw strings when dealing

with regular expressions; otherwise, things get too confusing too quickly (and regular expressions are

confusing enough already).

3. *sigh* Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address

contained the word 'ROAD' as a whole word by itself, but it wasn’t at the end, because the address had an

apartment number after the street designation. Because 'ROAD' isn’t at the very end of the string, it doesn’t

match, so the entire call to re.sub() ends up replacing nothing at all, and you get the original string back,

which is not what you want.

4. To solve this problem, I removed the $ character and added another \b. Now the regular expression reads

“match 'ROAD' when it’s a whole word by itself anywhere in the string,” whether at the end, the beginning,

or somewhere in the middle.

5.3. CASE STUDY: ROMAN NUMERALS

You’ve most likely seen Roman numerals, even if you didn’t recognize them. You may have seen them in

copyrights of old movies and television shows (“Copyright MCMXLVI” instead of “Copyright 1946”), or on the

dedication walls of libraries or universities (“established MDCCCLXXXVIII” instead of “established 1888”). You

may also have seen them in outlines and bibliographical references. It’s a system of representing numbers

that really does date back to the ancient Roman empire (hence the name).

131

In Roman numerals, there are seven characters that are repeated and combined in various ways to represent

numbers.

• I = 1

• V = 5

• X = 10

• L = 50

• C = 100

• D = 500

• M = 1000

The following are some general rules for constructing Roman numerals:

• Sometimes characters are additive. I is 1, II is 2, and III is 3. VI is 6 (literally, “5 and 1”), VII is 7, and

VIII is 8.

• The tens characters (I, X, C, and M) can be repeated up to three times. At 4, you need to subtract from the

next highest fives character. You can't represent 4 as IIII; instead, it is represented as IV (“1 less than 5”).

40 is written as XL (“10 less than 50”), 41 as XLI, 42 as XLII, 43 as XLIII, and then 44 as XLIV (“10 less

than 50, then 1 less than 5”).

• Sometimes characters are… the opposite of additive. By putting certain characters before others, you

subtract from the final value. For example, at 9, you need to subtract from the next highest tens character: 8

is VIII, but 9 is IX (“1 less than 10”), not VIIII (since the I character can not be repeated four times). 90

is XC, 900 is CM.

• The fives characters can not be repeated. 10 is always represented as X, never as VV. 100 is always C, never

LL.

• Roman numerals are read left to right, so the order of characters matters very much. DC is 600; CD is a

completely different number (400, “100 less than 500”). CI is 101; IC is not even a valid Roman numeral

(because you can't subtract 1 directly from 100; you would need to write it as XCIX, “10 less than 100, then

1 less than 10”).

132

5.3.1. CHECKING FOR THOUSANDS

What would it take to validate that an arbitrary string is a valid Roman numeral? Let’s take it one digit at a

time. Since Roman numerals are always written highest to lowest, let’s start with the highest: the thousands

place. For numbers 1000 and higher, the thousands are represented by a series of M characters.

>>> import re

>>> pattern = '^M?M?M?$'

>>> re.search(pattern, 'M')

<_sre.SRE_Match object at 0106FB58>

>>> re.search(pattern, 'MM')

<_sre.SRE_Match object at 0106C290>

>>> re.search(pattern, 'MMM')

<_sre.SRE_Match object at 0106AA38>

>>> re.search(pattern, 'MMMM')

>>> re.search(pattern, '')

<_sre.SRE_Match object at 0106F4A8>

1. This pattern has three parts. ^ matches what follows only at the beginning of the string. If this were not

specified, the pattern would match no matter where the M characters were, which is not what you want.

You want to make sure that the M characters, if they’re there, are at the beginning of the string. M?

optionally matches a single M character. Since this is repeated three times, you’re matching anywhere from

zero to three M characters in a row. And $ matches the end of the string. When combined with the ^

character at the beginning, this means that the pattern must match the entire string, with no other

characters before or after the M characters.

2. The essence of the re module is the search() function, that takes a regular expression (pattern) and a

string ('M') to try to match against the regular expression. If a match is found, search() returns an object

which has various methods to describe the match; if no match is found, search() returns None, the Python

null value. All you care about at the moment is whether the pattern matches, which you can tell by just

looking at the return value of search(). 'M' matches this regular expression, because the first optional M

matches and the second and third optional M characters are ignored.

3. 'MM' matches because the first and second optional M characters match and the third M is ignored.

4. 'MMM' matches because all three M characters match.

133

5. 'MMMM' does not match. All three M characters match, but then the regular expression insists on the string

ending (because of the $ character), and the string doesn’t end yet (because of the fourth M). So search()

returns None.

6. Interestingly, an empty string also matches this regular expression, since all the M characters are optional.

5.3.2. CHECKING FOR HUNDREDS

The hundreds place is more difficult than the thousands,

because there are several mutually exclusive ways it

could be expressed, depending on its value.

• 100 = C

? makes a

• 200 = CC

• 300 = CCC

pattern

• 400 = CD

• 500 = D

optional.

• 600 = DC

• 700 = DCC

• 800 = DCCC

• 900 = CM

So there are four possible patterns:

• CM

• CD

• Zero to three C characters (zero if the hundreds place is 0)

• D, followed by zero to three C characters

The last two patterns can be combined:

• an optional D, followed by zero to three C characters

This example shows how to validate the hundreds place of a Roman numeral.

134

>>> import re

>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'

>>> re.search(pattern, 'MCM')

<_sre.SRE_Match object at 01070390>

>>> re.search(pattern, 'MD')

<_sre.SRE_Match object at 01073A50>

>>> re.search(pattern, 'MMMCCC')

<_sre.SRE_Match object at 010748A8>

>>> re.search(pattern, 'MCMC')

>>> re.search(pattern, '')