预计阅读本页时间:-
XML Parsing Tools
XML is a tag-based language for defining structured information, commonly used to define documents and data shipped over the Web. Although some information can be extracted from XML text with basic string methods or the re pattern module, XML’s nesting of constructs and arbitrary attribute text tend to make full parsing more accurate.
Because XML is such a pervasive format, Python itself comes with an entire package of XML parsing tools that support the SAX and DOM parsing models, as well as a package known as ElementTree—a Python-specific API for parsing and constructing XML. Beyond basic parsing, the open source domain provides support for additional XML tools, such as XPath, Xquery, XSLT, and more.
广告:个人专属 VPN,独立 IP,无限流量,多机房切换,还可以屏蔽广告和恶意软件,每月最低仅 5 美元
XML by definition represents text in Unicode form, to support internationalization. Although most of Python’s XML parsing tools have always returned Unicode strings, in Python 3.0 their results have mutated from the 2.X unicode type to the 3.0 general str string type—which makes sense, given that 3.0’s str string is Unicode, whether the encoding is ASCII or other.
We can’t go into many details here, but to sample the flavor of this domain, suppose we have a simple XML text file, mybooks.xml:
<books>
<date>2009</date>
<title>Learning Python</title>
<title>Programming Python</title>
<title>Python Pocket Reference</title>
<publisher>O'Reilly Media</publisher>
</books>
and we want to run a script to extract and display the content of all the nested title tags, as follows:
Learning Python
Programming Python
Python Pocket Reference
There are at least four basic ways to accomplish this (not counting more advanced tools like XPath). First, we could run basic pattern matching on the file’s text, though this tends to be inaccurate if the text is unpredictable. Where applicable, the re module we met earlier does the job—its match method looks for a match at the start of a string, search scans ahead for a match, and the findall method used here locates all places where the pattern matches in the string (the result comes back as a list of matched substrings corresponding to parenthesized pattern groups, or tuples of such for multiple groups):
# File patternparse.py
import re
text = open('mybooks.xml').read()
found = re.findall('<title>(.*)</title>', text)
for title in found: print(title)
Second, to be more robust, we could perform complete XML parsing with the standard library’s DOM parsing support. DOM parses XML text into a tree of objects and provides an interface for navigating the tree to extract tag attributes and values; the interface is a formal specification, independent of Python:
# File domparse.py
from xml.dom.minidom import parse, Node
xmltree = parse('mybooks.xml')
for node1 in xmltree.getElementsByTagName('title'):
for node2 in node1.childNodes:
if node2.nodeType == Node.TEXT_NODE:
print(node2.data)
As a third option, Python’s standard library supports SAX parsing for XML. Under the SAX model, a class’s methods receive callbacks as a parse progresses and use state information to keep track of where they are in the document and collect its data:
# File saxparse.py
import xml.sax.handler
class BookHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.inTitle = False
def startElement(self, name, attributes):
if name == 'title':
self.inTitle = True
def characters(self, data):
if self.inTitle:
print(data)
def endElement(self, name):
if name == 'title':
self.inTitle = False
import xml.sax
parser = xml.sax.make_parser()
handler = BookHandler()
parser.setContentHandler(handler)
parser.parse('mybooks.xml')
Finally, the ElementTree system available in the etree package of the standard library can often achieve the same effects as XML DOM parsers, but with less code. It’s a Python-specific way to both parse and generate XML text; after a parse, its API gives access to components of the document:
# File etreeparse.py
from xml.etree.ElementTree import parse
tree = parse('mybooks.xml')
for E in tree.findall('title'):
print(E.text)
When run in either 2.6 or 3.0, all four of these scripts display the same printed result:
C:\misc> c:\python26\python domparse.py
Learning Python
Programming Python
Python Pocket Reference
C:\misc> c:\python30\python domparse.py
Learning Python
Programming Python
Python Pocket Reference
Technically, though, in 2.6 some of these scripts produce unicode string objects, while in 3.0 all produce str strings, since that type includes Unicode text (whether ASCII or other):
C:\misc> c:\python30\python
>>> from xml.dom.minidom import parse, Node
>>> xmltree = parse('mybooks.xml')
>>> for node in xmltree.getElementsByTagName('title'):
... for node2 in node.childNodes:
... if node2.nodeType == Node.TEXT_NODE:
... node2.data
...
'Learning Python'
'Programming Python'
'Python Pocket Reference'
C:\misc> c:\python26\python
>>> ...same code...
...
u'Learning Python'
u'Programming Python'
u'Python Pocket Reference'
Programs that must deal with XML parsing results in nontrivial ways will need to account for the different object type in 3.0. Again, though, because all strings have nearly identical interfaces in both 2.6 and 3.0, most scripts won’t be affected by the change; tools available on unicode in 2.6 are generally available on str in 3.0.
Regrettably, going into further XML parsing details is beyond this book’s scope. If you are interested in text or XML parsing, it is covered in more detail in the applications-focused follow-up book Programming Python. For more details on re, struct, pickle, and XML tools in general, consult the Web, the aforementioned book and others, and Python’s standard library manual.