第3页 | Dive Into Python 3-Mark Pilgrim

已读33%
预计阅读本页时间：-

Pocket

<_sre.SRE_Match object at 01071D98>

1. This pattern starts out the same as the previous one, checking for the beginning of the string (^), then the

thousands place (M?M?M?). Then it has the new part, in parentheses, which defines a set of three mutually

广告：个人专属 VPN，独立 IP，无限流量，多机房切换，还可以屏蔽广告和恶意软件，每月最低仅 5 美元

exclusive patterns, separated by vertical bars: CM, CD, and D?C?C?C? (which is an optional D followed by zero

to three optional C characters). The regular expression parser checks for each of these patterns in order

(from left to right), takes the first one that matches, and ignores the rest.

2. 'MCM' matches because the first M matches, the second and third M characters are ignored, and the CM

matches (so the CD and D?C?C?C? patterns are never even considered). MCM is the Roman numeral

representation of 1900.

3. 'MD' matches because the first M matches, the second and third M characters are ignored, and the D?C?C?C?

pattern matches D (each of the three C characters are optional and are ignored). MD is the Roman numeral

representation of 1500.

4. 'MMMCCC' matches because all three M characters match, and the D?C?C?C? pattern matches CCC (the D is

optional and is ignored). MMMCCC is the Roman numeral representation of 3300.

5. 'MCMC' does not match. The first M matches, the second and third M characters are ignored, and the CM

matches, but then the $ does not match because you’re not at the end of the string yet (you still have an

unmatched C character). The C does not match as part of the D?C?C?C? pattern, because the mutually

exclusive CM pattern has already matched.

6. Interestingly, an empty string still matches this pattern, because all the M characters are optional and ignored,

and the empty string matches the D?C?C?C? pattern where all the characters are optional and ignored.

135

Whew! See how quickly regular expressions can get nasty? And you’ve only covered the thousands and

hundreds places of Roman numerals. But if you followed all that, the tens and ones places are easy, because

they’re exactly the same pattern. But let’s look at another way to express the pattern.

⁂

5.4. USING THE {n,m} SYNTAX

In the previous section, you were dealing with a pattern

where the same character could be repeated up to

three times. There is another way to express this in

regular expressions, which some people find more

readable. First look at the method we already used in

{1,4}

the previous example.

>>> import re

matches

>>> pattern = '^M?M?M?$'

between 1

>>> re.search(pattern, 'M')

①

<_sre.SRE_Match object at 0x008EE090>

and 4

>>> pattern = '^M?M?M?$'

>>> re.search(pattern, 'MM')

②

occurrences

<_sre.SRE_Match object at 0x008EEB48>

>>> pattern = '^M?M?M?$'

of a pattern.

>>> re.search(pattern, 'MMM')

③

<_sre.SRE_Match object at 0x008EE090>

>>> re.search(pattern, 'MMMM')

④

>>>

1. This matches the start of the string, and then the first optional M, but not the second and third M (but that’s

okay because they’re optional), and then the end of the string.

2. This matches the start of the string, and then the first and second optional M, but not the third M (but that’s

okay because it’s optional), and then the end of the string.

136

3. This matches the start of the string, and then all three optional M, and then the end of the string.

4. This matches the start of the string, and then all three optional M, but then does not match the end of the

string (because there is still one unmatched M), so the pattern does not match and returns None.

>>> pattern = '^M{0,3}$'

①

>>> re.search(pattern, 'M')

②

<_sre.SRE_Match object at 0x008EEB48>

>>> re.search(pattern, 'MM')

③

<_sre.SRE_Match object at 0x008EE090>

>>> re.search(pattern, 'MMM')

④

<_sre.SRE_Match object at 0x008EEDA8>

>>> re.search(pattern, 'MMMM')

⑤

>>>

1. This pattern says: “Match the start of the string, then anywhere from zero to three M characters, then the

end of the string.” The 0 and 3 can be any numbers; if you want to match at least one but no more than

three M characters, you could say M{1,3}.

2. This matches the start of the string, then one M out of a possible three, then the end of the string.

3. This matches the start of the string, then two M out of a possible three, then the end of the string.

4. This matches the start of the string, then three M out of a possible three, then the end of the string.

5. This matches the start of the string, then three M out of a possible three, but then does not match the end of

the string. The regular expression allows for up to only three M characters before the end of the string, but

you have four, so the pattern does not match and returns None.

5.4.1. CHECKING FOR TENS AND ONES

Now let’s expand the Roman numeral regular expression to cover the tens and ones place. This example

shows the check for tens.

137

>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'

>>> re.search(pattern, 'MCMXL')

①

<_sre.SRE_Match object at 0x008EEB48>

>>> re.search(pattern, 'MCML')

②

<_sre.SRE_Match object at 0x008EEB48>

>>> re.search(pattern, 'MCMLX')

③

<_sre.SRE_Match object at 0x008EEB48>

>>> re.search(pattern, 'MCMLXXX')

④

<_sre.SRE_Match object at 0x008EEB48>

>>> re.search(pattern, 'MCMLXXXX')

⑤

>>>

1. This matches the start of the string, then the first optional M, then CM, then XL, then the end of the string.

Remember, the (A|B|C) syntax means “match exactly one of A, B, or C”. You match XL, so you ignore the

XC and L?X?X?X? choices, and then move on to the end of the string. MCMXL is the Roman numeral

representation of 1940.

2. This matches the start of the string, then the first optional M, then CM, then L?X?X?X?. Of the L?X?X?X?, it

matches the L and skips all three optional X characters. Then you move to the end of the string. MCML is the

Roman numeral representation of 1950.

3. This matches the start of the string, then the first optional M, then CM, then the optional L and the first

optional X, skips the second and third optional X, then the end of the string. MCMLX is the Roman numeral

representation of 1960.

4. This matches the start of the string, then the first optional M, then CM, then the optional L and all three

optional X characters, then the end of the string. MCMLXXX is the Roman numeral representation of 1980.

5. This matches the start of the string, then the first optional M, then CM, then the optional L and all three

optional X characters, then fails to match the end of the string because there is still one more X unaccounted

for. So the entire pattern fails to match, and returns None. MCMLXXXX is not a valid Roman numeral.

138

The expression for the ones place follows the same

pattern. I’ll spare you the details and show you the end

result.

(A|B)

matches

either

pattern A or

pattern B,

but not

both.

>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'

So what does that look like using this alternate {n,m} syntax? This example shows the new syntax.

>>> pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'

>>> re.search(pattern, 'MDLV')

①

<_sre.SRE_Match object at 0x008EEB48>

>>> re.search(pattern, 'MMDCLXVI')

②

<_sre.SRE_Match object at 0x008EEB48>

>>> re.search(pattern, 'MMMDCCCLXXXVIII')

③

<_sre.SRE_Match object at 0x008EEB48>

>>> re.search(pattern, 'I')

④

<_sre.SRE_Match object at 0x008EEB48>

139

1. This matches the start of the string, then one of a possible three M characters, then D?C{0,3}. Of that, it

matches the optional D and zero of three possible C characters. Moving on, it matches L?X{0,3} by matching

the optional L and zero of three possible X characters. Then it matches V?I{0,3} by matching the optional V

and zero of three possible I characters, and finally the end of the string. MDLV is the Roman numeral

representation of 1555.

2. This matches the start of the string, then two of a possible three M characters, then the D?C{0,3} with a D

and one of three possible C characters; then L?X{0,3} with an L and one of three possible X characters;

then V?I{0,3} with a V and one of three possible I characters; then the end of the string. MMDCLXVI is the

Roman numeral representation of 2666.

3. This matches the start of the string, then three out of three M characters, then D?C{0,3} with a D and three

out of three C characters; then L?X{0,3} with an L and three out of three X characters; then V?I{0,3}

with a V and three out of three I characters; then the end of the string. MMMDCCCLXXXVIII is the Roman

numeral representation of 3888, and it’s the longest Roman numeral you can write without extended syntax.

4. Watch closely. (I feel like a magician. “Watch closely, kids, I’m going to pull a rabbit out of my hat.”) This

matches the start of the string, then zero out of three M, then matches D?C{0,3} by skipping the optional D

and matching zero out of three C, then matches L?X{0,3} by skipping the optional L and matching zero out

of three X, then matches V?I{0,3} by skipping the optional V and matching one out of three I. Then the

end of the string. Whoa.

If you followed all that and understood it on the first try, you’re doing better than I did. Now imagine trying

to understand someone else’s regular expressions, in the middle of a critical function of a large program. Or

even imagine coming back to your own regular expressions a few months later. I’ve done it, and it’s not a

pretty sight.

Now let’s explore an alternate syntax that can help keep your expressions maintainable.

⁂

140

5.5. VERBOSE REGULAR EXPRESSIONS

So far you’ve just been dealing with what I’ll call “compact” regular expressions. As you’ve seen, they are

difficult to read, and even if you figure out what one does, that’s no guarantee that you’ll be able to

understand it six months later. What you really need is inline documentation.

Python allows you to do this with something called verbose regular expressions. A verbose regular expression

is different from a compact regular expression in two ways:

• Whitespace is ignored. Spaces, tabs, and carriage returns are not matched as spaces, tabs, and carriage

returns. They’re not matched at all. (If you want to match a space in a verbose regular expression, you’ll

need to escape it by putting a backslash in front of it.)

• Comments are ignored. A comment in a verbose regular expression is just like a comment in Python code:

it starts with a # character and goes until the end of the line. In this case it’s a comment within a multi-line

string instead of within your source code, but it works the same way.

This will be more clear with an example. Let’s revisit the compact regular expression you’ve been working

with, and make it a verbose regular expression. This example shows how.

141

>>> pattern = '''

^ # beginning of string

M{0,3} # thousands - 0 to 3 Ms

(CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 Cs),

# or 500-800 (D, followed by 0 to 3 Cs)

(XC|XL|L?X{0,3}) # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 Xs),

# or 50-80 (L, followed by 0 to 3 Xs)

(IX|IV|V?I{0,3}) # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 Is),

# or 5-8 (V, followed by 0 to 3 Is)

$ # end of string

'''

>>> re.search(pattern, 'M', re.VERBOSE)

①

<_sre.SRE_Match object at 0x008EEB48>

>>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)

②

<_sre.SRE_Match object at 0x008EEB48>

>>> re.search(pattern, 'MMMDCCCLXXXVIII', re.VERBOSE)

③

<_sre.SRE_Match object at 0x008EEB48>

>>> re.search(pattern, 'M')

④

1. The most important thing to remember when using verbose regular expressions is that you need to pass an

extra argument when working with them: re.VERBOSE is a constant defined in the re module that signals

that the pattern should be treated as a verbose regular expression. As you can see, this pattern has quite a

bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore

the whitespace and the comments, this is exactly the same regular expression as you saw in the previous

section, but it’s a lot more readable.

2. This matches the start of the string, then one of a possible three M, then CM, then L and three of a possible

three X, then IX, then the end of the string.

3. This matches the start of the string, then three of a possible three M, then D and three of a possible three C,

then L and three of a possible three X, then V and three of a possible three I, then the end of the string.

4. This does not match. Why? Because it doesn’t have the re.VERBOSE flag, so the re.search function is

treating the pattern as a compact regular expression, with significant whitespace and literal hash marks.

Python can’t auto-detect whether a regular expression is verbose or not. Python assumes every regular

expression is compact unless you explicitly state that it is verbose.

142

⁂

5.6. CASE STUDY: PARSING PHONE NUMBERS

So far you’ve concentrated on matching whole patterns.

Either the pattern matches, or it doesn’t. But regular

expressions are much more powerful than that. When a

regular expression does match, you can pick out specific

pieces of it. You can find out what matched where.

\d matches

This example came from another real-world problem I

any numeric

encountered, again from a previous day job. The

problem: parsing an American phone number. The client

digit (0–9).

wanted to be able to enter the number free-form (in a

single field), but then wanted to store the area code,

\D matches

trunk, number, and optionally an extension separately in

the company’s database. I scoured the Web and found

anything

many examples of regular expressions that purported to

do this, but none of them were permissive enough.

but digits.

Here are the phone numbers I needed to be able to

accept:

• 800-555-1212

• 800 555 1212

• 800.555.1212

• (800) 555-1212

• 1-800-555-1212

• 800-555-1212-1234

• 800-555-1212x1234

• 800-555-1212 ext. 1234

• work 1-(800) 555.1212 #1234

143

Quite a variety! In each of these cases, I need to know that the area code was 800, the trunk was 555, and

the rest of the phone number was 1212. For those with an extension, I need to know that the extension

was 1234.

Let’s work through developing a solution for phone number parsing. This example shows the first step.

>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')

①

>>> phonePattern.search('800-555-1212').groups()

②

('800', '555', '1212')

>>> phonePattern.search('800-555-1212-1234')

③

>>> phonePattern.search('800-555-1212-1234').groups()

④

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

AttributeError: 'NoneType' object has no attribute 'groups'

1. Always read regular expressions from left to right. This one matches the beginning of the string, and then

(\d{3}). What’s \d{3}? Well, \d means “any numeric digit” (0 through 9). The {3} means “match exactly

three numeric digits”; it’s a variation on the {n,m} syntax you saw earlier. Putting it all in parentheses means “match exactly three numeric digits, and then remember them as a group that I can ask for later”. Then

match a literal hyphen. Then match another group of exactly three digits. Then another literal hyphen. Then

another group of exactly four digits. Then match the end of the string.

2. To get access to the groups that the regular expression parser remembered along the way, use the

groups() method on the object that the search() method returns. It will return a tuple of however many

groups were defined in the regular expression. In this case, you defined three groups, one with three digits,

one with three digits, and one with four digits.

3. This regular expression is not the final answer, because it doesn’t handle a phone number with an extension

on the end. For that, you’ll need to expand the regular expression.

4. And this is why you should never “chain” the search() and groups() methods in production code. If the

search() method returns no matches, it returns None, not a regular expression match object. Calling None.groups() raises a perfectly obvious exception: None doesn’t have a groups() method. (Of course, it’s

slightly less obvious when you get this exception from deep within your code. Yes, I speak from experience

here.)

144

>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$')

①

>>> phonePattern.search('800-555-1212-1234').groups()

②

('800', '555', '1212', '1234')

>>> phonePattern.search('800 555 1212 1234')

③

>>>

>>> phonePattern.search('800-555-1212')

④

>>>

1. This regular expression is almost identical to the previous one. Just as before, you match the beginning of

the string, then a remembered group of three digits, then a hyphen, then a remembered group of three

digits, then a hyphen, then a remembered group of four digits. What’s new is that you then match another

hyphen, and a remembered group of one or more digits, then the end of the string.

2. The groups() method now returns a tuple of four elements, since the regular expression now defines four

groups to remember.

3. Unfortunately, this regular expression is not the final answer either, because it assumes that the different

parts of the phone number are separated by hyphens. What if they’re separated by spaces, or commas, or

dots? You need a more general solution to match several different types of separators.

4. Oops! Not only does this regular expression not do everything you want, it’s actually a step backwards,

because now you can’t parse phone numbers without an extension. That’s not what you wanted at all; if the

extension is there, you want to know what it is, but if it’s not there, you still want to know what the

different parts of the main number are.

The next example shows the regular expression to handle separators between the different parts of the

phone number.

>>> phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')

①

>>> phonePattern.search('800 555 1212 1234').groups()

②

('800', '555', '1212', '1234')

>>> phonePattern.search('800-555-1212-1234').groups()

③

('800', '555', '1212', '1234')

>>> phonePattern.search('80055512121234')

④

>>>

>>> phonePattern.search('800-555-1212')

⑤

>>>

145

1. Hang on to your hat. You’re matching the beginning of the string, then a group of three digits, then \D+.

What the heck is that? Well, \D matches any character except a numeric digit, and + means “1 or more”. So

\D+ matches one or more characters that are not digits. This is what you’re using instead of a literal hyphen,

to try to match different separators.

2. Using \D+ instead of - means you can now match phone numbers where the parts are separated by spaces

instead of hyphens.

3. Of course, phone numbers separated by hyphens still work too.

4. Unfortunately, this is still not the final answer, because it assumes that there is a separator at all. What if the

phone number is entered without any spaces or hyphens at all?

5. Oops! This still hasn’t fixed the problem of requiring extensions. Now you have two problems, but you can

solve both of them with the same technique.

The next example shows the regular expression for handling phone numbers without separators.

>>> phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')

①

>>> phonePattern.search('80055512121234').groups()

②

('800', '555', '1212', '1234')

>>> phonePattern.search('800.555.1212 x1234').groups()

③

('800', '555', '1212', '1234')

>>> phonePattern.search('800-555-1212').groups()

④

('800', '555', '1212', '')

>>> phonePattern.search('(800)5551212 x1234')

⑤

>>>

1. The only change you’ve made since that last step is changing all the + to *. Instead of \D+ between the parts

of the phone number, you now match on \D*. Remember that + means “1 or more”? Well, * means “zero

or more”. So now you should be able to parse phone numbers even when there is no separator character

at all.

2. Lo and behold, it actually works. Why? You matched the beginning of the string, then a remembered group

of three digits (800), then zero non-numeric characters, then a remembered group of three digits (555), then

zero non-numeric characters, then a remembered group of four digits (1212), then zero non-numeric

characters, then a remembered group of an arbitrary number of digits (1234), then the end of the string.

3. Other variations work now too: dots instead of hyphens, and both a space and an x before the extension.

146

4. Finally, you’ve solved the other long-standing problem: extensions are optional again. If no extension is found,

the groups() method still returns a tuple of four elements, but the fourth element is just an empty string.

5. I hate to be the bearer of bad news, but you’re not finished yet. What’s the problem here? There’s an extra

character before the area code, but the regular expression assumes that the area code is the first thing at

the beginning of the string. No problem, you can use the same technique of “zero or more non-numeric

characters” to skip over the leading characters before the area code.

The next example shows how to handle leading characters in phone numbers.

>>> phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')

①

>>> phonePattern.search('(800)5551212 ext. 1234').groups()

②

('800', '555', '1212', '1234')

>>> phonePattern.search('800-555-1212').groups()

③

('800', '555', '1212', '')

>>> phonePattern.search('work 1-(800) 555.1212 #1234')

④

>>>

1. This is the same as in the previous example, except now you’re matching \D*, zero or more non-numeric

characters, before the first remembered group (the area code). Notice that you’re not remembering these

non-numeric characters (they’re not in parentheses). If you find them, you’ll just skip over them and then

start remembering the area code whenever you get to it.

2. You can successfully parse the phone number, even with the leading left parenthesis before the area code.

(The right parenthesis after the area code is already handled; it’s treated as a non-numeric separator and

matched by the \D* after the first remembered group.)

3. Just a sanity check to make sure you haven’t broken anything that used to work. Since the leading characters

are entirely optional, this matches the beginning of the string, then zero non-numeric characters, then a

remembered group of three digits (800), then one non-numeric character (the hyphen), then a remembered

group of three digits (555), then one non-numeric character (the hyphen), then a remembered group of four

digits (1212), then zero non-numeric characters, then a remembered group of zero digits, then the end of

the string.

4. This is where regular expressions make me want to gouge my eyes out with a blunt object. Why doesn’t

this phone number match? Because there’s a 1 before the area code, but you assumed that all the leading

characters before the area code were non-numeric characters (\D*). Aargh.

147

Let’s back up for a second. So far the regular expressions have all matched from the beginning of the string.

But now you see that there may be an indeterminate amount of stuff at the beginning of the string that you

want to ignore. Rather than trying to match it all just so you can skip over it, let’s take a different approach:

don’t explicitly match the beginning of the string at all. This approach is shown in the next example.

>>> phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')

①

>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()

②

('800', '555', '1212', '1234')

>>> phonePattern.search('800-555-1212').groups()

③

('800', '555', '1212', '')

>>> phonePattern.search('80055512121234').groups()

④

('800', '555', '1212', '1234')

1. Note the lack of ^ in this regular expression. You are not matching the beginning of the string anymore.

There’s nothing that says you need to match the entire input with your regular expression. The regular

expression engine will do the hard work of figuring out where the input string starts to match, and go from

there.

2. Now you can successfully parse a phone number that includes leading characters and a leading digit, plus any

number of any kind of separators around each part of the phone number.

3. Sanity check. This still works.

4. That still works too.

See how quickly a regular expression can get out of control? Take a quick glance at any of the previous

iterations. Can you tell the difference between one and the next?

While you still understand the final answer (and it is the final answer; if you’ve discovered a case it doesn’t

handle, I don’t want to know about it), let’s write it out as a verbose regular expression, before you forget

why you made the choices you made.

148

>>> phonePattern = re.compile(r'''

# don't match beginning of string, number can start anywhere

(\d{3}) # area code is 3 digits (e.g. '800')

\D* # optional separator is any number of non-digits

(\d{3}) # trunk is 3 digits (e.g. '555')

\D* # optional separator

(\d{4}) # rest of number is 4 digits (e.g. '1212')

\D* # optional separator

(\d*) # extension is optional and can be any number of digits

$ # end of string

''', re.VERBOSE)

>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()

①

('800', '555', '1212', '1234')

>>> phonePattern.search('800-555-1212')

②

('800', '555', '1212', '')

1. Other than being spread out over multiple lines, this is exactly the same regular expression as the last step,

so it’s no surprise that it parses the same inputs.

2. Final sanity check. Yes, this still works. You’re done.

⁂

5.7. SUMMARY

This is just the tiniest tip of the iceberg of what regular expressions can do. In other words, even though

you’re completely overwhelmed by them now, believe me, you ain’t seen nothing yet.

You should now be familiar with the following techniques:

• ^ matches the beginning of a string.

• $ matches the end of a string.

• \b matches a word boundary.

149

• \d matches any numeric digit.

• \D matches any non-numeric character.

• x? matches an optional x character (in other words, it matches an x zero or one times).

• x* matches x zero or more times.

• x+ matches x one or more times.

• x{n,m} matches an x character at least n times, but not more than m times.

• (a|b|c) matches exactly one of a, b or c.

• (x) in general is a remembered group. You can get the value of what matched by using the groups() method

of the object returned by re.search.

Regular expressions are extremely powerful, but they are not the correct solution for every problem. You

should learn enough about them to know when they are appropriate, when they will solve your problems,

and when they will cause more problems than they solve.

150

CHAPTER 6. CLOSURES & GENERATORS

❝ My spelling is Wobbly. It’s good spelling but it Wobbles, and the letters get in the wrong places. ❞

— Winnie-the-Pooh

6.1. DIVING IN

HavinggrownupthesonofalibrarianandanEnglishmajor,Ihavealwaysbeenfascinatedby

languages. Not programming languages. Well yes, programming languages, but also natural languages. Take

English. English is a schizophrenic language that borrows words from German, French, Spanish, and Latin (to

name a few). Actually, “borrows” is the wrong word; “pillages” is more like it. Or perhaps

“assimilates” — like the Borg. Yes, I like that.

We are the Borg. Your linguistic and etymological distinctiveness will be added to our own.

Resistance is futile.

In this chapter, you’re going to learn about plural nouns. Also, functions that return other functions,

advanced regular expressions, and generators. But first, let’s talk about how to make plural nouns. (If you

haven’t read the chapter on regular expressions, now would be a good time. This chapter assumes you understand the basics of regular expressions, and it quickly descends into more advanced uses.)

If you grew up in an English-speaking country or learned English in a formal school setting, you’re probably

familiar with the basic rules:

• If a word ends in S, X, or Z, add ES. Bass becomes basses, fax becomes faxes, and waltz becomes waltzes.

• If a word ends in a noisy H, add ES; if it ends in a silent H, just add S. What’s a noisy H? One that gets

combined with other letters to make a sound that you can hear. So coach becomes coaches and rash

becomes rashes, because you can hear the CH and SH sounds when you say them. But cheetah becomes

cheetahs, because the H is silent.

151

• If a word ends in Y that sounds like I, change the Y to IES; if the Y is combined with a vowel to sound like

something else, just add S. So vacancy becomes vacancies, but day becomes days.

• If all else fails, just add S and hope for the best.

(I know, there are a lot of exceptions. Man becomes men and woman becomes women, but human becomes humans. Mouse becomes mice and louse becomes lice, but house becomes houses. Knife becomes knives and wife becomes wives, but lowlife becomes lowlifes. And don’t even get me started on words that are their own plural, like sheep, deer, and haiku.)

Other languages, of course, are completely different.

Let’s design a Python library that automatically pluralizes English nouns. We’ll start with just these four rules,

but keep in mind that you’ll inevitably need to add more.

⁂

6.2. I KNOW, LET’S USE REGULAR EXPRESSIONS!

So you’re looking at words, which, at least in English, means you’re looking at strings of characters. You

have rules that say you need to find different combinations of characters, then do different things to them.

This sounds like a job for regular expressions!

152

import re

def plural(noun):

if re.search('[sxz]$', noun):

①

return re.sub('$', 'es', noun)

②

elif re.search('[^aeioudgkprt]h$', noun):

return re.sub('$', 'es', noun)

elif re.search('[^aeiou]y$', noun):

return re.sub('y$', 'ies', noun)

else:

return noun + 's'

1. This is a regular expression, but it uses a syntax you didn’t see in Regular Expressions. The square brackets mean “match exactly one of these characters.” So [sxz] means “s, or x, or z”, but only one of them. The $

should be familiar; it matches the end of string. Combined, this regular expression tests whether noun ends

with s, x, or z.

2. This re.sub() function performs regular expression-based string substitutions.

Let’s look at regular expression substitutions in more detail.

>>> import re

>>> re.search('[abc]', 'Mark')

①

<_sre.SRE_Match object at 0x001C1FA8>

>>> re.sub('[abc]', 'o', 'Mark')

②

'Mork'

>>> re.sub('[abc]', 'o', 'rock')

③

'rook'

>>> re.sub('[abc]', 'o', 'caps')

④

'oops'

1. Does the string Mark contain a, b, or c? Yes, it contains a.

2. OK, now find a, b, or c, and replace it with o. Mark becomes Mork.

3. The same function turns rock into rook.

153

4. You might think this would turn caps into oaps, but it doesn’t. re.sub replaces all of the matches, not just

the first one. So this regular expression turns caps into oops, because both the c and the a get turned into

And now, back to the plural() function…

def plural(noun):

if re.search('[sxz]$', noun):

return re.sub('$', 'es', noun)

①

elif re.search('[^aeioudgkprt]h$', noun):

②

return re.sub('$', 'es', noun)

elif re.search('[^aeiou]y$', noun):

③

return re.sub('y$', 'ies', noun)

else:

return noun + 's'

1. Here, you’re replacing the end of the string (matched by $) with the string es. In other words, adding es to

the string. You could accomplish the same thing with string concatenation, for example noun + 'es', but I

chose to use regular expressions for each rule, for reasons that will become clear later in the chapter.

2. Look closely, this is another new variation. The ^ as the first character inside the square brackets means

something special: negation. [^abc] means “any single character except a, b, or c”. So [^aeioudgkprt]

means any character except a, e, i, o, u, d, g, k, p, r, or t. Then that character needs to be followed by h,

followed by end of string. You’re looking for words that end in H where the H can be heard.

3. Same pattern here: match words that end in Y, where the character before the Y is not a, e, i, o, or u.

You’re looking for words that end in Y that sounds like I.

Let’s look at negation regular expressions in more detail.

154

>>> import re

>>> re.search('[^aeiou]y$', 'vacancy')

①

<_sre.SRE_Match object at 0x001C1FA8>

>>> re.search('[^aeiou]y$', 'boy')

②

>>>

>>> re.search('[^aeiou]y$', 'day')

>>>

>>> re.search('[^aeiou]y$', 'pita')

③

>>>

1. vacancy matches this regular expression, because it ends in cy, and c is not a, e, i, o, or u.

2. boy does not match, because it ends in oy, and you specifically said that the character before the y could

not be o. day does not match, because it ends in ay.

3. pita does not match, because it does not end in y.

>>> re.sub('y$', 'ies', 'vacancy')

①

'vacancies'

>>> re.sub('y$', 'ies', 'agency')

'agencies'

>>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy')

②

'vacancies'

1. This regular expression turns vacancy into vacancies and agency into agencies, which is what you

wanted. Note that it would also turn boy into boies, but that will never happen in the function because you

did that re.search first to find out whether you should do this re.sub.

2. Just in passing, I want to point out that it is possible to combine these two regular expressions (one to find

out if the rule applies, and another to actually apply it) into a single regular expression. Here’s what that

would look like. Most of it should look familiar: you’re using a remembered group, which you learned in

Case study: Parsing Phone Numbers. The group is used to remember the character before the letter y. Then in the substitution string, you use a new syntax, \1, which means “hey, that first group you remembered?

put it right here.” In this case, you remember the c before the y; when you do the substitution, you

substitute c in place of c, and ies in place of y. (If you have more than one remembered group, you can

use \2 and \3 and so on.)

155

Regular expression substitutions are extremely powerful, and the \1 syntax makes them even more powerful.

But combining the entire operation into one regular expression is also much harder to read, and it doesn’t

directly map to the way you first described the pluralizing rules. You originally laid out rules like “if the

word ends in S, X, or Z, then add ES”. If you look at this function, you have two lines of code that say “if

the word ends in S, X, or Z, then add ES”. It doesn’t get much more direct than that.

⁂

6.3. A LIST OF FUNCTIONS

Now you’re going to add a level of abstraction. You started by defining a list of rules: if this, do that,

otherwise go to the next rule. Let’s temporarily complicate part of the program so you can simplify another

part.

156

import re

def match_sxz(noun):

return re.search('[sxz]$', noun)

def apply_sxz(noun):

return re.sub('$', 'es', noun)

def match_h(noun):

return re.search('[^aeioudgkprt]h$', noun)

def apply_h(noun):

return re.sub('$', 'es', noun)

def match_y(noun):

①

return re.search('[^aeiou]y$', noun)

def apply_y(noun):

②

return re.sub('y$', 'ies', noun)

def match_default(noun):

return True

def apply_default(noun):

return noun + 's'

rules = ((match_sxz, apply_sxz),

③

(match_h, apply_h),

(match_y, apply_y),

(match_default, apply_default)

)

def plural(noun):

for matches_rule, apply_rule in rules:

④

157

if matches_rule(noun):

return apply_rule(noun)

1. Now, each match rule is its own function which returns the results of calling the re.search() function.

2. Each apply rule is also its own function which calls the re.sub() function to apply the appropriate

pluralization rule.

3. Instead of having one function (plural()) with multiple rules, you have the rules data structure, which is a

sequence of pairs of functions.

4. Since the rules have been broken out into a separate data structure, the new plural() function can be

reduced to a few lines of code. Using a for loop, you can pull out the match and apply rules two at a time

(one match, one apply) from the rules structure. On the first iteration of the for loop, matches_rule will

get match_sxz, and apply_rule will get apply_sxz. On the second iteration (assuming you get that far),

matches_rule will be assigned match_h, and apply_rule will be assigned apply_h. The function is

guaranteed to return something eventually, because the final match rule (match_default) simply returns

True, meaning the corresponding apply rule (apply_default) will always be applied.

The reason this technique works is that everything in

Python is an object, including functions. The rules data

structure contains functions — not names of functions,

but actual function objects. When they get assigned in

the for loop, then matches_rule and apply_rule are

The “rules”

actual functions that you can call. On the first iteration

of the for loop, this is equivalent to calling

variable is a

matches_sxz(noun), and if it returns a match, calling

apply_sxz(noun).

sequence of

If this additional level of abstraction is confusing, try

pairs of

unrolling the function to see the equivalence. The entire

for loop is equivalent to the following:

functions.

158

def plural(noun):

if match_sxz(noun):

return apply_sxz(noun)

if match_h(noun):

return apply_h(noun)

if match_y(noun):

return apply_y(noun)

if match_default(noun):

return apply_default(noun)

The benefit here is that the plural() function is now simplified. It takes a sequence of rules, defined

elsewhere, and iterates through them in a generic fashion.

1. Get a match rule

2. Does it match? Then call the apply rule and return the result.

3. No match? Go to step 1.

The rules could be defined anywhere, in any way. The plural() function doesn’t care.

Now, was adding this level of abstraction worth it? Well, not yet. Let’s consider what it would take to add a

new rule to the function. In the first example, it would require adding an if statement to the plural()

function. In this second example, it would require adding two functions, match_foo() and apply_foo(), and

then updating the rules sequence to specify where in the order the new match and apply functions should

be called relative to the other rules.

But this is really just a stepping stone to the next section. Let’s move on…

⁂

159

6.4. A LIST OF PATTERNS

Defining separate named functions for each match and apply rule isn’t really necessary. You never call them

directly; you add them to the rules sequence and call them through there. Furthermore, each function

follows one of two patterns. All the match functions call re.search(), and all the apply functions call

re.sub(). Let’s factor out the patterns so that defining new rules can be easier.

import re

def build_match_and_apply_functions(pattern, search, replace):

def matches_rule(word):

①

return re.search(pattern, word)

def apply_rule(word):

②

return re.sub(search, replace, word)

return (matches_rule, apply_rule)

③

1. build_match_and_apply_functions() is a function that builds other functions dynamically. It takes

pattern, search and replace, then defines a matches_rule() function which calls re.search() with the

pattern that was passed to the build_match_and_apply_functions() function, and the word that was

passed to the matches_rule() function you’re building. Whoa.

2. Building the apply function works the same way. The apply function is a function that takes one parameter,

and calls re.sub() with the search and replace parameters that were passed to the

build_match_and_apply_functions() function, and the word that was passed to the apply_rule()

function you’re building. This technique of using the values of outside parameters within a dynamic function is

called closures. You’re essentially defining constants within the apply function you’re building: it takes one

parameter (word), but it then acts on that plus two other values (search and replace) which were set

when you defined the apply function.

3. Finally, the build_match_and_apply_functions() function returns a tuple of two values: the two functions

you just created. The constants you defined within those functions (pattern within the matches_rule()

function, and search and replace within the apply_rule() function) stay with those functions, even after

you return from build_match_and_apply_functions(). That’s insanely cool.

If this is incredibly confusing (and it should be, this is weird stuff), it may become clearer when you see how

to use it.

160

patterns = \

①

(

('[sxz]$', '$', 'es'),

('[^aeioudgkprt]h$', '$', 'es'),

('(qu|[^aeiou])y$', 'y$', 'ies'),

('$', '$', 's')

②

)

rules = [build_match_and_apply_functions(pattern, search, replace)

③

for (pattern, search, replace) in patterns]

1. Our pluralization “rules” are now defined as a tuple of tuples of strings (not functions). The first string in

each group is the regular expression pattern that you would use in re.search() to see if this rule matches.

The second and third strings in each group are the search and replace expressions you would use in

re.sub() to actually apply the rule to turn a noun into its plural.

2. There’s a slight change here, in the fallback rule. In the previous example, the match_default() function

simply returned True, meaning that if none of the more specific rules matched, the code would simply add

an s to the end of the given word. This example does something functionally equivalent. The final regular

expression asks whether the word has an end ($ matches the end of a string). Of course, every string has

an end, even an empty string, so this expression always matches. Thus, it serves the same purpose as the

match_default() function that always returned True: it ensures that if no more specific rule matches, the

code adds an s to the end of the given word.

3. This line is magic. It takes the sequence of strings in patterns and turns them into a sequence of functions.

How? By “mapping” the strings to the build_match_and_apply_functions() function. That is, it takes each

triplet of strings and calls the build_match_and_apply_functions() function with those three strings as

arguments. The build_match_and_apply_functions() function returns a tuple of two functions. This means

that rules ends up being functionally equivalent to the previous example: a list of tuples, where each tuple is

a pair of functions. The first function is the match function that calls re.search(), and the second function

is the apply function that calls re.sub().

Rounding out this version of the script is the main entry point, the plural() function.

161

def plural(noun):

for matches_rule, apply_rule in rules:

①

if matches_rule(noun):

return apply_rule(noun)

1. Since the rules list is the same as the previous example (really, it is), it should come as no surprise that the

plural() function hasn’t changed at all. It’s completely generic; it takes a list of rule functions and calls them

in order. It doesn’t care how the rules are defined. In the previous example, they were defined as separate

named functions. Now they are built dynamically by mapping the output of the

build_match_and_apply_functions() function onto a list of raw strings. It doesn’t matter; the plural()

function still works the same way.

⁂

6.5. A FILE OF PATTERNS

You’ve factored out all the duplicate code and added enough abstractions so that the pluralization rules are

defined in a list of strings. The next logical step is to take these strings and put them in a separate file,

where they can be maintained separately from the code that uses them.

First, let’s create a text file that contains the rules you want. No fancy data structures, just whitespace-

delimited strings in three columns. Let’s call it plural4-rules.txt.

[sxz]$ $ es

[^aeioudgkprt]h$ $ es

[^aeiou]y$ y$ ies

$ $ s

Now let’s see how you can use this rules file.

162

import re

def build_match_and_apply_functions(pattern, search, replace):

①

def matches_rule(word):

return re.search(pattern, word)

def apply_rule(word):

return re.sub(search, replace, word)

return (matches_rule, apply_rule)

rules = []

with open('plural4-rules.txt', encoding='utf-8') as pattern_file:

②

for line in pattern_file:

③

pattern, search, replace = line.split(None, 3)

④

rules.append(build_match_and_apply_functions(

⑤

pattern, search, replace))

1. The build_match_and_apply_functions() function has not changed. You’re still using closures to build

two functions dynamically that use variables defined in the outer function.

2. The global open() function opens a file and returns a file object. In this case, the file we’re opening contains

the pattern strings for pluralizing nouns. The with statement creates what’s called a context: when the with

block ends, Python will automatically close the file, even if an exception is raised inside the with block.

You’ll learn more about with blocks and file objects in the Files chapter.

3. The for line in <fileobject> idiom reads data from the open file, one line at a time, and assigns the

text to the line variable. You’ll learn more about reading from files in the Files chapter.

4. Each line in the file really has three values, but they’re separated by whitespace (tabs or spaces, it makes no

difference). To split it out, use the split() string method. The first argument to the split() method is

None, which means “split on any whitespace (tabs or spaces, it makes no difference).” The second argument

is 3, which means “split on whitespace 3 times, then leave the rest of the line alone.” A line like [sxz]$ $

es will be broken up into the list ['[sxz]$', '$', 'es'], which means that pattern will get '[sxz]$',

search will get '$', and replace will get 'es'. That’s a lot of power in one little line of code.

5. Finally, you pass pattern, search, and replace to the build_match_and_apply_functions() function,

which returns a tuple of functions. You append this tuple to the rules list, and rules ends up storing the

list of match and apply functions that the plural() function expects.

163

The improvement here is that you’ve completely separated the pluralization rules into an external file, so it

can be maintained separately from the code that uses it. Code is code, data is data, and life is good.

⁂

6.6. GENERATORS

Wouldn’t it be grand to have a generic plural() function that parses the rules file? Get rules, check for a

match, apply appropriate transformation, go to next rule. That’s all the plural() function has to do, and

that’s all the plural() function should do.

def rules(rules_filename):

with open(rules_filename, encoding='utf-8') as pattern_file:

for line in pattern_file:

pattern, search, replace = line.split(None, 3)

yield build_match_and_apply_functions(pattern, search, replace)

def plural(noun, rules_filename='plural5-rules.txt'):

for matches_rule, apply_rule in rules(rules_filename):

if matches_rule(noun):

return apply_rule(noun)

raise ValueError('no matching rule for {0}'.format(noun))

How the heck does that work? Let’s look at an interactive example first.

164

>>> def make_counter(x):

...

print('entering make_counter')

...

while True:

...

yield x

①

...

print('incrementing x')

...

x = x + 1

...

>>> counter = make_counter(2)

②

>>> counter

③

>>> next(counter)

④

entering make_counter

>>> next(counter)

⑤

incrementing x

>>> next(counter)

⑥

incrementing x

1. The presence of the yield keyword in make_counter means that this is not a normal function. It is a special

kind of function which generates values one at a time. You can think of it as a resumable function. Calling it

will return a generator that can be used to generate successive values of x.

2. To create an instance of the make_counter generator, just call it like any other function. Note that this does

not actually execute the function code. You can tell this because the first line of the make_counter()

function calls print(), but nothing has been printed yet.

3. The make_counter() function returns a generator object.

4. The next() function takes a generator object and returns its next value. The first time you call next() with

the counter generator, it executes the code in make_counter() up to the first yield statement, then

returns the value that was yielded. In this case, that will be 2, because you originally created the generator

by calling make_counter(2).

5. Repeatedly calling next() with the same generator object resumes exactly where it left off and continues

until it hits the next yield statement. All variables, local state, & c. are saved on yield and restored on

next(). The next line of code waiting to be executed calls print(), which prints incrementing x. After

165

that, the statement x = x + 1. Then it loops through the while loop again, and the first thing it hits is the

statement yield x, which saves the state of everything and returns the current value of x (now 3).

6. The second time you call next(counter), you do all the same things again, but this time x is now 4.

Since make_counter sets up an infinite loop, you could theoretically do this forever, and it would just keep

incrementing x and spitting out values. But let’s look at more productive uses of generators instead.

6.6.1. A FIBONACCI GENERATOR

def fib(max):

a, b = 0, 1

①

while a < max:

yield a

②

a, b = b, a + b

③

“yield”

1. The Fibonacci sequence is a sequence of numbers where

pauses a

each number is the sum of the two numbers before it.

It starts with 0 and 1, goes up slowly at first, then more

function.

and more rapidly. To start the sequence, you need two

variables: a starts at 0, and b starts at 1.

“next()”

2. a is the current number in the sequence, so yield it.

3. b is the next number in the sequence, so assign that to

resumes

a, but also calculate the next value (a + b) and assign

that to b for later use. Note that this happens in

where it left

parallel; if a is 3 and b is 5, then a, b = b, a + b will

set a to 5 (the previous value of b) and b to 8 (the sum

off.

of the previous values of a and b).

So you have a function that spits out successive

Fibonacci numbers. Sure, you could do that with

recursion, but this way is easier to read. Also, it works well with for loops.

166

>>> from fibonacci import fib

>>> for n in fib(1000):

①

...

print(n, end=' ')

②

0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987

>>> list(fib(1000))

③

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987]

1. You can use a generator like fib() in a for loop directly. The for loop will automatically call the next()

function to get values from the fib() generator and assign them to the for loop index variable (n).

2. Each time through the for loop, n gets a new value from the yield statement in fib(), and all you have to

do is print it out. Once fib() runs out of numbers (a becomes bigger than max, which in this case is 1000),

then the for loop exits gracefully.

3. This is a useful idiom: pass a generator to the list() function, and it will iterate through the entire

generator (just like the for loop in the previous example) and return a list of all the values.

6.6.2. A PLURAL RULE GENERATOR

Let’s go back to plural5.py and see how this version of the plural() function works.

def rules(rules_filename):

with open(rules_filename, encoding='utf-8') as pattern_file:

for line in pattern_file:

pattern, search, replace = line.split(None, 3)

①

yield build_match_and_apply_functions(pattern, search, replace)

②

def plural(noun, rules_filename='plural5-rules.txt'):

for matches_rule, apply_rule in rules(rules_filename):

③

if matches_rule(noun):

return apply_rule(noun)

raise ValueError('no matching rule for {0}'.format(noun))

1. No magic here. Remember that the lines of the rules file have three values separated by whitespace, so you

use line.split(None, 3) to get the three “columns” and assign them to three local variables.

167

2. And then you yield. What do you yield? Two functions, built dynamically with your old friend,

build_match_and_apply_functions(), which is identical to the previous examples. In other words,

rules() is a generator that spits out match and apply functions on demand.

3. Since rules() is a generator, you can use it directly in a for loop. The first time through the for loop, you

will call the rules() function, which will open the pattern file, read the first line, dynamically build a match

function and an apply function from the patterns on that line, and yield the dynamically built functions. The

second time through the for loop, you will pick up exactly where you left off in rules() (which was in the

middle of the for line in pattern_file loop). The first thing it will do is read the next line of the file

(which is still open), dynamically build another match and apply function based on the patterns on that line in

the file, and yield the two functions.

What have you gained over stage 4? Startup time. In stage 4, when you imported the plural4 module, it

read the entire patterns file and built a list of all the possible rules, before you could even think about calling

the plural() function. With generators, you can do everything lazily: you read the first rule and create

functions and try them, and if that works you don’t ever read the rest of the file or create any other

functions.

What have you lost? Performance! Every time you call the plural() function, the rules() generator starts

over from the beginning — which means re-opening the patterns file and reading from the beginning, one line

at a time.

What if you could have the best of both worlds: minimal startup cost (don’t execute any code on import),

and maximum performance (don’t build the same functions over and over again). Oh, and you still want to

keep the rules in a separate file (because code is code and data is data), just as long as you never have to

read the same line twice.

To do that, you’ll need to build your own iterator. But before you do that, you need to learn about Python

classes.

⁂

168

6.7. FURTHER READING

• PEP 255: Simple Generators

• Understanding Python’s “with” statement

• Closures in Python

• Fibonacci numbers

• English Irregular Plural Nouns

169

CHAPTER 7. CLASSES & ITERATORS

❝ East is East, and West is West, and never the twain shall meet. ❞

— Rudyard Kipling

7.1. DIVING IN

Iteratorsarethe“secretsauce”ofPython3.They’reeverywhere,underlyingeverything,alwaysjustout

of sight. Comprehensions are just a simple form of iterators. Generators are just a simple form of iterators. A function that yields values is a nice, compact way of building an iterator without building an iterator. Let me

show you what I mean by that.

Remember the Fibonacci generator? Here it is as a built-from-scratch iterator:

170

class Fib:

'''iterator that yields numbers in the Fibonacci sequence'''

def __init__(self, max):

self.max = max

def __iter__(self):

self.a = 0

self.b = 1

return self

def __next__(self):

fib = self.a

if fib > self.max:

raise StopIteration

self.a, self.b = self.b, self.a + self.b

return fib

Let’s take that one line at a time.

class Fib:

class? What’s a class?

⁂

7.2. DEFINING CLASSES

Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes,

and instantiate the classes you’ve defined.

171

Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define

the class and start coding. A Python class starts with the reserved word class, followed by the class name.

Technically, that’s all that’s required, since a class doesn’t need to inherit from any other class.

class PapayaWhip:

①

pass

②

1. The name of this class is PapayaWhip, and it doesn’t inherit from any other class. Class names are usually

capitalized, EachWordLikeThis, but this is only a convention, not a requirement.

2. You probably guessed this, but everything in a class is indented, just like the code within a function, if

statement, for loop, or any other block of code. The first line not indented is outside the class.

This PapayaWhip class doesn’t define any methods or attributes, but syntactically, there needs to be

something in the definition, thus the pass statement. This is a Python reserved word that just means “move

along, nothing to see here”. It’s a statement that does nothing, and it’s a good placeholder when you’re

stubbing out functions or classes.

☞ The pass statement in Python is like a empty set of curly braces ({}) in Java or C.

Many classes are inherited from other classes, but this one is not. Many classes define methods, but this one

does not. There is nothing that a Python class absolutely must have, other than a name. In particular, C++

programmers may find it odd that Python classes don’t have explicit constructors and destructors. Although

it’s not required, Python classes can have something similar to a constructor: the __init__() method.

7.2.1. THE __init__() METHOD

This example shows the initialization of the Fib class using the __init__ method.

class Fib:

'''iterator that yields numbers in the Fibonacci sequence'''

①

def __init__(self, max):

②

172

1. Classes can (and should) have docstrings too, just like modules and functions.

2. The __init__() method is called immediately after an instance of the class is created. It would be

tempting — but technically incorrect — to call this the “constructor” of the class. It’s tempting, because it

looks like a C++ constructor (by convention, the __init__() method is the first method defined for the

class), acts like one (it’s the first piece of code executed in a newly created instance of the class), and even

sounds like one. Incorrect, because the object has already been constructed by the time the __init__()

method is called, and you already have a valid reference to the new instance of the class.

The first argument of every class method, including the __init__() method, is always a reference to the

current instance of the class. By convention, this argument is named self. This argument fills the role of the

reserved word this in C++ or Java, but self is not a reserved word in Python, merely a naming

convention. Nonetheless, please don’t call it anything but self; this is a very strong convention.

In the __init__() method, self refers to the newly created object; in other class methods, it refers to the

instance whose method was called. Although you need to specify self explicitly when defining the method,

you do not specify it when calling the method; Python will add it for you automatically.

⁂

7.3. INSTANTIATING CLASSES

Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a

function, passing the arguments that the __init__() method requires. The return value will be the newly

created object.

173

>>> import fibonacci2

>>> fib = fibonacci2.Fib(100)

①

>>> fib

②

<fibonacci2.Fib object at 0x00DB8810>

>>> fib.__class__

③

>>> fib.__doc__

④

'iterator that yields numbers in the Fibonacci sequence'

1. You are creating an instance of the Fib class (defined in the fibonacci2 module) and assigning the newly

created instance to the variable fib. You are passing one parameter, 100, which will end up as the max

argument in Fib’s __init__() method.

2. fib is now an instance of the Fib class.

3. Every class instance has a built-in attribute, __class__, which is the object’s class. Java programmers may be

familiar with the Class class, which contains methods like getName() and getSuperclass() to get metadata

information about an object. In Python, this kind of metadata is available through attributes, but the idea is

the same.

4. You can access the instance’s docstring just as with a function or a module. All instances of a class share

the same docstring.

☞ In Python, simply call a class as if it were a function to create a new instance of the

class. There is no explicit new operator like there is in C++ or Java.

⁂

7.4. INSTANCE VARIABLES

On to the next line:

174

class Fib:

def __init__(self, max):

self.max = max

①

1. What is self.max? It’s an instance variable. It is completely separate from max, which was passed into the

__init__() method as an argument. self.max is “global” to the instance. That means that you can access it

from other methods.

class Fib:

def __init__(self, max):

self.max = max

①

def __next__(self):

fib = self.a

if fib > self.max:

②

1. self.max is defined in the __init__() method…

2. …and referenced in the __next__() method.

Instance variables are specific to one instance of a class. For example, if you create two Fib instances with

different maximum values, they will each remember their own values.

>>> import fibonacci2

>>> fib1 = fibonacci2.Fib(100)

>>> fib2 = fibonacci2.Fib(200)

>>> fib1.max

100

>>> fib2.max

200

⁂

175

7.5. A FIBONACCI ITERATOR

Now you’re ready to learn how to build an iterator. An iterator is just a class that defines an __iter__()

method.

class Fib:

①

def __init__(self, max):

②

All three of these class

self.max = max

methods, __init__,

__iter__, and

def __iter__(self):

③

__next__, begin and

self.a = 0

end with a pair of

self.b = 1

underscore (_)

return self

characters. Why is that?

There’s nothing magical

def __next__(self):

④

about it, but it usually

fib = self.a

indicates that these are

if fib > self.max:

“special methods.” The

raise StopIteration

⑤

only thing “special”

self.a, self.b = self.b, self.a + self.b

about special methods is

return fib

⑥

that they aren’t called

directly; Python calls

them when you use some

1. To build an iterator from scratch, Fib needs to be a class, not a

other syntax on the class

function.

or an instance of the

2. “Calling” Fib(max) is really creating an instance of this class and

class. More about special

calling its __init__() method with max. The __init__() method

methods.

saves the maximum value as an instance variable so other methods

can refer to it later.

3. The __iter__() method is called whenever someone calls

iter(fib). (As you’ll see in a minute, a for loop will call this automatically, but you can also call it yourself

manually.) After performing beginning-of-iteration initialization (in this case, resetting self.a and self.b, our

two counters), the __iter__() method can return any object that implements a __next__() method. In this

case (and in most cases), __iter__() simply returns self, since this class implements its own __next__()

method.

176

4. The __next__() method is called whenever someone calls next() on an iterator of an instance of a class.

That will make more sense in a minute.

5. When the __next__() method raises a StopIteration exception, this signals to the caller that the iteration

is exhausted. Unlike most exceptions, this is not an error; it’s a normal condition that just means that the

iterator has no more values to generate. If the caller is a for loop, it will notice this StopIteration

exception and gracefully exit the loop. (In other words, it will swallow the exception.) This little bit of magic

is actually the key to using iterators in for loops.

6. To spit out the next value, an iterator’s __next__() method simply returns the value. Do not use yield

here; that’s a bit of syntactic sugar that only applies when you’re using generators. Here you’re creating your

own iterator from scratch; use return instead.

Thoroughly confused yet? Excellent. Let’s see how to call this iterator:

>>> from fibonacci2 import Fib

>>> for n in Fib(1000):

...

print(n, end=' ')

0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987

Why, it’s exactly the same! Byte for byte identical to how you called Fibonacci-as-a-generator (modulo one capital letter). But how?

There’s a bit of magic involved in for loops. Here’s what happens:

• The for loop calls Fib(1000), as shown. This returns an instance of the Fib class. Call this fib_inst.

• Secretly, and quite cleverly, the for loop calls iter(fib_inst), which returns an iterator object. Call this

fib_iter. In this case, fib_iter == fib_inst, because the __iter__() method returns self, but the for

loop doesn’t know (or care) about that.

• To “loop through” the iterator, the for loop calls next(fib_iter), which calls the __next__() method on

the fib_iter object, which does the next-Fibonacci-number calculations and returns a value. The for loop

takes this value and assigns it to n, then executes the body of the for loop for that value of n.

• How does the for loop know when to stop? I’m glad you asked! When next(fib_iter) raises a

StopIteration exception, the for loop will swallow the exception and gracefully exit. (Any other exception

will pass through and be raised as usual.) And where have you seen a StopIteration exception? In the

__next__() method, of course!

177

⁂

7.6. A PLURAL RULE ITERATOR

Now it’s time for the finale. Let’s rewrite the plural

rules generator as an iterator.

iter(f) calls

f.__iter__

next(f) calls

f.__next__

178

class LazyRules:

rules_filename = 'plural6-rules.txt'

def __init__(self):

self.pattern_file = open(self.rules_filename, encoding='utf-8')

self.cache = []

def __iter__(self):

self.cache_index = 0

return self

def __next__(self):

self.cache_index += 1

if len(self.cache) >= self.cache_index:

return self.cache[self.cache_index - 1]

if self.pattern_file.closed:

raise StopIteration

line = self.pattern_file.readline()

if not line:

self.pattern_file.close()

raise StopIteration

pattern, search, replace = line.split(None, 3)

funcs = build_match_and_apply_functions(

pattern, search, replace)

self.cache.append(funcs)

return funcs

rules = LazyRules()

So this is a class that implements __iter__() and __next__(), so it can be used as an iterator. Then, you

instantiate the class and assign it to rules. This happens just once, on import.

179

Let’s take the class one bite at a time.

class LazyRules:

rules_filename = 'plural6-rules.txt'

def __init__(self):

self.pattern_file = open(self.rules_filename, encoding='utf-8')

①

self.cache = []

②

1. When we instantiate the LazyRules class, open the pattern file but don’t read anything from it. (That comes

later.)

2. After opening the patterns file, initialize the cache. You’ll use this cache later (in the __next__() method) as

you read lines from the pattern file.

Before we continue, let’s take a closer look at rules_filename. It’s not defined within the __iter__()

method. In fact, it’s not defined within any method. It’s defined at the class level. It’s a class variable, and although you can access it just like an instance variable (self.rules_filename), it is shared across all

instances of the LazyRules class.

180

>>> import plural6

>>> r1 = plural6.LazyRules()

>>> r2 = plural6.LazyRules()

>>> r1.rules_filename

①

'plural6-rules.txt'

>>> r2.rules_filename

'plural6-rules.txt'

>>> r2.rules_filename = 'r2-override.txt'

②

>>> r2.rules_filename

'r2-override.txt'

>>> r1.rules_filename

'plural6-rules.txt'

>>> r2.__class__.rules_filename

③

'plural6-rules.txt'

>>> r2.__class__.rules_filename = 'papayawhip.txt'

④

>>> r1.rules_filename

'papayawhip.txt'

>>> r2.rules_filename

⑤

'r2-overridetxt'

1. Each instance of the class inherits the rules_filename attribute with the value defined by the class.

2. Changing the attribute’s value in one instance does not affect other instances…

3. …nor does it change the class attribute. You can access the class attribute (as opposed to an individual

instance’s attribute) by using the special __class__ attribute to access the class itself.

4. If you change the class attribute, all instances that are still inheriting that value (like r1 here) will be affected.

5. Instances that have overridden that attribute (like r2 here) will not be affected.

And now back to our show.

def __iter__(self):

①

self.cache_index = 0

return self

②

1. The __iter__() method will be called every time someone — say, a for loop — calls iter(rules).

181

2. The one thing that every __iter__() method must do is return an iterator. In this case, it returns self,

which signals that this class defines a __next__() method which will take care of returning values

throughout the iteration.

def __next__(self):

①

pattern, search, replace = line.split(None, 3)

funcs = build_match_and_apply_functions(

②

pattern, search, replace)

self.cache.append(funcs)

③

return funcs

1. The __next__() method gets called whenever someone — say, a for loop — calls next(rules). This

method will only make sense if we start at the end and work backwards. So let’s do that.

2. The last part of this function should look familiar, at least. The build_match_and_apply_functions()

function hasn’t changed; it’s the same as it ever was.

3. The only difference is that, before returning the match and apply functions (which are stored in the tuple

funcs), we’re going to save them in self.cache.

Moving backwards…

def __next__(self):

line = self.pattern_file.readline()

①

if not line:

②

self.pattern_file.close()

raise StopIteration

③

182

1. A bit of advanced file trickery here. The readline() method (note: singular, not the plural readlines())

reads exactly one line from an open file. Specifically, the next line. ( File objects are iterators too! It’s iterators all

the way down…)

2. If there was a line for readline() to read, line will not be an empty string. Even if the file contained a

blank line, line would end up as the one-character string '\n' (a carriage return). If line is really an empty

string, that means there are no more lines to read from the file.

3. When we reach the end of the file, we should close the file and raise the magic StopIteration exception.

Remember, we got to this point because we needed a match and apply function for the next rule. The next

rule comes from the next line of the file… but there is no next line! Therefore, we have no value to return.

The iteration is over. (♫ The party’s over… ♫)

Moving backwards all the way to the start of the __next__() method…

def __next__(self):

self.cache_index += 1

if len(self.cache) >= self.cache_index:

return self.cache[self.cache_index - 1]

①

if self.pattern_file.closed:

raise StopIteration

②

1. self.cache will be a list of the functions we need to match and apply individual rules. (At least that should

sound familiar!) self.cache_index keeps track of which cached item we should return next. If we haven’t

exhausted the cache yet ( i.e. if the length of self.cache is greater than self.cache_index), then we have a

cache hit! Hooray! We can return the match and apply functions from the cache instead of building them

from scratch.

2. On the other hand, if we don’t get a hit from the cache, and the file object has been closed (which could

happen, further down the method, as you saw in the previous code snippet), then there’s nothing more we

can do. If the file is closed, it means we’ve exhausted it — we’ve already read through every line from the

pattern file, and we’ve already built and cached the match and apply functions for each pattern. The file is

exhausted; the cache is exhausted; I’m exhausted. Wait, what? Hang in there, we’re almost done.

183

Putting it all together, here’s what happens when:

• When the module is imported, it creates a single instance of the LazyRules class, called rules, which opens

the pattern file but does not read from it.

• When asked for the first match and apply function, it checks its cache but finds the cache is empty. So it

reads a single line from the pattern file, builds the match and apply functions from those patterns, and caches

them.

• Let’s say, for the sake of argument, that the very first rule matched. If so, no further match and apply

functions are built, and no further lines are read from the pattern file.

• Furthermore, for the sake of argument, suppose that the caller calls the plural() function again to pluralize

a different word. The for loop in the plural() function will call iter(rules), which will reset the cache

index but will not reset the open file object.

• The first time through, the for loop will ask for a value from rules, which will invoke its __next__()

method. This time, however, the cache is primed with a single pair of match and apply functions,

corresponding to the patterns in the first line of the pattern file. Since they were built and cached in the

course of pluralizing the previous word, they’re retrieved from the cache. The cache index increments, and

the open file is never touched.

• Let’s say, for the sake of argument, that the first rule does not match this time around. So the for loop

comes around again and asks for another value from rules. This invokes the __next__() method a second

time. This time, the cache is exhausted — it only contained one item, and we’re asking for a second — so

the __next__() method continues. It reads another line from the open file, builds match and apply functions

out of the patterns, and caches them.

• This read-build-and-cache process will continue as long as the rules being read from the pattern file don’t

match the word we’re trying to pluralize. If we do find a matching rule before the end of the file, we simply

use it and stop, with the file still open. The file pointer will stay wherever we stopped reading, waiting for

the next readline() command. In the meantime, the cache now has more items in it, and if we start all

over again trying to pluralize a new word, each of those items in the cache will be tried before reading the

next line from the pattern file.

We have achieved pluralization nirvana.

1. Minimal startup cost. The only thing that happens on import is instantiating a single class and opening a

file (but not reading from it).

184

2. Maximum performance. The previous example would read through the file and build functions

dynamically every time you wanted to pluralize a word. This version will cache functions as soon as they’re

built, and in the worst case, it will only read through the pattern file once, no matter how many words you

pluralize.

3. Separation of code and data. All the patterns are stored in a separate file. Code is code, and data is

data, and never the twain shall meet.

☞ Is this really nirvana? Well, yes and no. Here’s something to consider with the

LazyRules example: the pattern file is opened (during __init__()), and it remains

open until the final rule is reached. Python will eventually close the file when it exits,

or after the last instantiation of the LazyRules class is destroyed, but still, that could

be a long time. If this class is part of a long-running Python process, the Python

interpreter may never exit, and the LazyRules object may never get destroyed.

There are ways around this. Instead of opening the file during __init__() and

leaving it open while you read rules one line at a time, you could open the file, read

all the rules, and immediately close the file. Or you could open the file, read one

rule, save the file position with the tell() method, close the file, and later re-open

it and use the seek() method to continue reading where you left off. Or you could

not worry about it and just leave the file open, like this example code does.

Programming is design, and design is all about trade-offs and constraints. Leaving a file

open too long might be a problem; making your code more complicated might be a

problem. Which one is the bigger problem depends on your development team, your

application, and your runtime environment.

⁂

7.7. FURTHER READING

• Iterator types

• PEP 234: Iterators

185

• PEP 255: Simple Generators

• Generator Tricks for Systems Programmers

186

CHAPTER 8. ADVANCED ITERATORS

❝ Great fleas have little fleas upon their backs to bite ’em,

And little fleas have lesser fleas, and so ad infinitum. ❞

— Augustus De Morgan

8.1. DIVING IN

Justasregularexpressionsputstringsonsteroids,theitertoolsmoduleputsiteratorsonsteroids.But first, I want to show you a classic puzzle.

HAWAII + IDAHO + IOWA + OHIO == STATES

510199 + 98153 + 9301 + 3593 == 621246

H = 5

A = 1

W = 0

I = 9

D = 8

O = 3

S = 6

T = 2

E = 4

Puzzles like this are called cryptarithms or alphametics. The letters spell out actual words, but if you replace each letter with a digit from 0–9, it also “spells” an arithmetic equation. The trick is to figure out which

letter maps to each digit. All the occurrences of each letter must map to the same digit, no digit can be

repeated, and no “word” can start with the digit 0.

187

In this chapter, we’ll dive into an incredible Python

program originally written by Raymond Hettinger. This

program solves alphametic puzzles in just 14 lines of code.

The most

well-known

alphametic

puzzle is

SEND +

MORE =

MONEY .

188

import re

import itertools

def solve(puzzle):

words = re.findall('[A-Z]+', puzzle.upper())

unique_characters = set(''.join(words))

assert len(unique_characters) <= 10, 'Too many letters'

first_letters = {word[0] for word in words}

n = len(first_letters)

sorted_characters = ''.join(first_letters) + \

''.join(unique_characters - first_letters)

characters = tuple(ord(c) for c in sorted_characters)

digits = tuple(ord(c) for c in '0123456789')

zero = digits[0]

for guess in itertools.permutations(digits, len(characters)):

if zero not in guess[:n]:

equation = puzzle.translate(dict(zip(characters, guess)))

if eval(equation):

return equation

if __name__ == '__main__':

import sys

for puzzle in sys.argv[1:]:

print(puzzle)

solution = solve(puzzle)

if solution:

print(solution)

You can run the program from the command line. On Linux, it would look like this. (These may take some

time, depending on the speed of your computer, and there is no progress bar. Just be patient!)

189

you@localhost:~/diveintopython3/examples$ python3 alphametics.py "HAWAII + IDAHO + IOWA + OHIO == STATES"

HAWAII + IDAHO + IOWA + OHIO = STATES

510199 + 98153 + 9301 + 3593 == 621246

you@localhost:~/diveintopython3/examples$ python3 alphametics.py "I + LOVE + YOU == DORA"

I + LOVE + YOU == DORA

1 + 2784 + 975 == 3760

you@localhost:~/diveintopython3/examples$ python3 alphametics.py "SEND + MORE == MONEY"

SEND + MORE == MONEY

9567 + 1085 == 10652

⁂

8.2. FINDING ALL OCCURRENCES OF A PATTERN

The first thing this alphametics solver does is find all the letters (A–Z) in the puzzle.

>>> import re

>>> re.findall('[0-9]+', '16 2-by-4s in rows of 8')

①

['16', '2', '4', '8']

>>> re.findall('[A-Z]+', 'SEND + MORE == MONEY')

②

['SEND', 'MORE', 'MONEY']

1. The re module is Python’s implementation of regular expressions. It has a nifty function called findall() which takes a regular expression pattern and a string, and finds all occurrences of the pattern within the

string. In this case, the pattern matches sequences of numbers. The findall() function returns a list of all

the substrings that matched the pattern.

2. Here the regular expression pattern matches sequences of letters. Again, the return value is a list, and each

item in the list is a string that matched the regular expression pattern.

Here’s another example that will stretch your brain a little.

190

>>> re.findall(' s.*? s', "The sixth sick sheikh's sixth sheep's sick.")

[' sixth s', " sheikh's s", " sheep's s"]

Surprised? The regular expression looks for a space, an

s, and then the shortest possible series of any character

(.*?), then a space, then another s. Well, looking at

that input string, I see five matches:

This is the

1. The sixth sick sheikh's sixth sheep's sick.

2. The sixth sick sheikh's sixth sheep's sick.

hardest

3. The sixth sick sheikh's sixth sheep's sick.

4. The sixth sick sheikh's sixth sheep's sick.

tongue

5. The sixth sick sheikh's sixth sheep's sick.

twister in

But the re.findall() function only returned three

matches. Specifically, it returned the first, the third, and

the English

the fifth. Why is that? Because it doesn’t return

overlapping matches. The first match overlaps with the

language.

second, so the first is returned and the second is

skipped. Then the third overlaps with the fourth, so the

third is returned and the fourth is skipped. Finally, the

fifth is returned. Three matches, not five.

This has nothing to do with the alphametics solver; I just thought it was interesting.

⁂

8.3. FINDING THE UNIQUE ITEMS IN A SEQUENCE

Sets make it trivial to find the unique items in a sequence.

191

>>> a_list = ['The', 'sixth', 'sick', "sheik's", 'sixth', "sheep's", 'sick']

>>> set(a_list)

①

{'sixth', 'The', "sheep's", 'sick', "sheik's"}

>>> a_string = 'EAST IS EAST'

>>> set(a_string)

②

{'A', ' ', 'E', 'I', 'S', 'T'}

>>> words = ['SEND', 'MORE', 'MONEY']

>>> ''.join(words)

③

'SENDMOREMONEY'

>>> set(''.join(words))

④

{'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}

1. Given a list of several strings, the set() function will return a set of unique strings from the list. This makes

sense if you think of it like a for loop. Take the first item from the list, put it in the set. Second. Third.

Fourth. Fifth — wait, that’s in the set already, so it only gets listed once, because Python sets don’t allow

duplicates. Sixth. Seventh — again, a duplicate, so it only gets listed once. The end result? All the unique

items in the original list, without any duplicates. The original list doesn’t even need to be sorted first.

2. The same technique works with strings, since a string is just a sequence of characters.

3. Given a list of strings, ''.join(a_list) concatenates all the strings together into one.

4. So, given a list of strings, this line of code returns all the unique characters across all the strings, with no

duplicates.

The alphametics solver uses this technique to build a set of all the unique characters in the puzzle.

unique_characters = set(''.join(words))

This list is later used to assign digits to characters as the solver iterates through the possible solutions.

⁂

192

8.4. MAKING ASSERTIONS

Like many programming languages, Python has an assert statement. Here’s how it works.

>>> assert 1 + 1 == 2

①

>>> assert 1 + 1 == 3

②

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

AssertionError

>>> assert 2 + 2 == 5, "Only for very large values of 2"

③

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

AssertionError: Only for very large values of 2

1. The assert statement is followed by any valid Python expression. In this case, the expression 1 + 1 == 2

evaluates to True, so the assert statement does nothing.

2. However, if the Python expression evaluates to False, the assert statement will raise an AssertionError.

3. You can also include a human-readable message that is printed if the AssertionError is raised.

Therefore, this line of code:

assert len(unique_characters) <= 10, 'Too many letters'

…is equivalent to this:

if len(unique_characters) > 10:

raise AssertionError('Too many letters')

The alphametics solver uses this exact assert statement to bail out early if the puzzle contains more than

ten unique letters. Since each letter is assigned a unique digit, and there are only ten digits, a puzzle with

more than ten unique letters can not possibly have a solution.

⁂

193

8.5. GENERATOR EXPRESSIONS

A generator expression is like a generator function without the function.

>>> unique_characters = {'E', 'D', 'M', 'O', 'N', 'S', 'R', 'Y'}

>>> gen = (ord(c) for c in unique_characters)

①

>>> gen

②

>>> next(gen)

③

>>> next(gen)

>>> tuple(ord(c) for c in unique_characters)

④

(69, 68, 77, 79, 78, 83, 82, 89)

1. A generator expression is like an anonymous function that yields values. The expression itself looks like a list

comprehension, but it’s wrapped in parentheses instead of square brackets.

2. The generator expression returns… an iterator.

3. Calling next(gen) returns the next value from the iterator.

4. If you like, you can iterate through all the possible values and return a tuple, list, or set, by passing the

generator expression to tuple(), list(), or set(). In these cases, you don’t need an extra set of

parentheses — just pass the “bare” expression ord(c) for c in unique_characters to the tuple()

function, and Python figures out that it’s a generator expression.

☞ Using a generator expression instead of a list comprehension can save both CPU and

R A M . If you’re building an list just to throw it away ( e.g. passing it to tuple() or

set()), use a generator expression instead!

Here’s another way to accomplish the same thing, using a generator function:

194

def ord_map(a_string):

for c in a_string:

yield ord(c)

gen = ord_map(unique_characters)

The generator expression is more compact but functionally equivalent.

⁂

8.6. CALCULATING PERMUTATIONS… THE LAZY WAY!

First of all, what the heck are permutations? Permutations are a mathematical concept. (There are actually

several definitions, depending on what kind of math you’re doing. Here I’m talking about combinatorics, but if

that doesn’t mean anything to you, don’t worry about it. As always, Wikipedia is your friend.)

The idea is that you take a list of things (could be numbers, could be letters, could be dancing bears) and

find all the possible ways to split them up into smaller lists. All the smaller lists have the same size, which

can be as small as 1 and as large as the total number of items. Oh, and nothing can be repeated.

Mathematicians say things like “let’s find the permutations of 3 different items taken 2 at a time,” which

means you have a sequence of 3 items and you want to find all the possible ordered pairs.

195

>>> import itertools

①

>>> perms = itertools.permutations([1, 2, 3], 2)

②

>>> next(perms)

③

(1, 2)

>>> next(perms)

(1, 3)

>>> next(perms)

(2, 1)

④

>>> next(perms)

(2, 3)

>>> next(perms)

(3, 1)

>>> next(perms)

(3, 2)

>>> next(perms)

⑤

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

StopIteration

1. The itertools module has all kinds of fun stuff in it, including a permutations() function that does all the

hard work of finding permutations.

2. The permutations() function takes a sequence (here a list of three integers) and a number, which is the

number of items you want in each smaller group. The function returns an iterator, which you can use in a

for loop or any old place that iterates. Here I’ll step through the iterator manually to show all the values.

3. The first permutation of [1, 2, 3] taken 2 at a time is (1, 2).

4. Note that permutations are ordered: (2, 1) is different than (1, 2).

5. That’s it! Those are all the permutations of [1, 2, 3] taken 2 at a time. Pairs like (1, 1) and (2, 2)

never show up, because they contain repeats so they aren’t valid permutations. When there are no more

permutations, the iterator raises a StopIteration exception.

196

The permutations() function doesn’t have to take a

list. It can take any sequence — even a string.

The

itertools

module has

all kinds of

fun stuff.

197

>>> import itertools

>>> perms = itertools.permutations('ABC', 3)

①

>>> next(perms)

('A', 'B', 'C')

②

>>> next(perms)

('A', 'C', 'B')

>>> next(perms)

('B', 'A', 'C')

>>> next(perms)

('B', 'C', 'A')

>>> next(perms)

('C', 'A', 'B')

>>> next(perms)

('C', 'B', 'A')

>>> next(perms)

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

StopIteration

请支持我们，让我们可以支付服务器费用。
使用微信支付打赏

←→

阅读 ‧ 电子书库

热门推荐