Chapter 34. File Formats: CSV, Tab, XML, Logs and Others
XML Files: The
xml.minidom and xml.sax
Modules
XML files are text files, intended for human consumption, that mix
markup with content. The markup uses a number of relatively simple
rules. Additionally, there are structural requirements that assure that
an XML file has a minimal level of validity. There are additional rules
(either a Document Type Defintion, DTD, or an XML Schema Definition,
XSD) that provide additional structural rules.
There are three separate XML parsers available with Python. We'll
ignore the xml.expat module (not for any good
reason), and focus on the xml.sax and
xml.minidom parsers.
xml.sax Parsing. The Standard API for XML (SAX) parser is described as an event
parser. The parser recognizes different elements of an XML document
and invokes methods in a handler which you provide. Your handler will
be given pieces of the document, and can do appropriate processing
with those pieces.
For most XML processing, your program will have the following
outline: This parser will then use your
ContentHandler as it parses.
Define a subclass of
xml.sax.ContentHandler. The methods of this
class will do your unique processing will happen.
Request the module to create an instance of an
xml.sax.Parser.
Create an instance of your handler class. Provide this to the
parser you created.
Set any features or options in the parser.
Invoke the parser on your document (or incoming stream of data
from a network socket).
Here's a short example that shows the essentials of building a
simple XML parser with the xml.sax module. This
example defines a simple ContentHandler that
prints the tags as well as counting the occurances of the
<informaltable> tag.
Since the parsing is event-driven, your handler must accumulate
any context required to determine where the individual tags occur. In
some content models (like XHTML and DocBook) there are two levels of
markup: structural and semantic. The structural markup includes books,
parts, chapters, sections, lists and the like. The semantic markup is
sometimes called "inline" markup, and it includes tags to identify
function names, class names, exception names, variable names, and the
like. When processing this kind of document, you're application must
determine the which tag is which.
A ContentHandler Subclass. The heart of a SAX parser is the subclass of
ContentHandler that you define in your
application. There are a number of methods which you may want to
override. Minimally, you'll override the
startElement and
characters methods. There are other methods
of this class described in section 13.10.1 of the Python
Library Reference.
setDocumentLocator(locator)
The parser will call this method to provide an
xml.sax.Locator object. This object has the
XML document ID information, plus line and column information. The
locator will be updated within the parser, so it should only be
used within these handler methods.
startDocument
The parser will call this method at the start of the
document. It can be used for initialization and resetting any
context information.
endDocument
This method is paired with the
startDocument method; it is called once
by the parser at the end of the document.
startElement(name, attrs)
The parser calls this method with each tag that is found, in
non-namespace mode. The name is the string with
the tag name. The attrs parameter is an
xml.sax.Attributes object. This object is
reused by the parser; your handler cannot save this object. The
xml.sax.Attributes object behaves somewhat like a mapping. It
doesn't support the [] operator for getting values,
but does support get,
has_key, items,
keys, and values
methods.
endElement(name)
The parser calls this method with each tag that is found, in
non-namespace mode. The name is the string with
the tag name.
startElementNS(name, qname, attrs)
The parser calls this method with each tag that is found, in
namespace mode. You set namesace mode by using the parser's
p.setFeature( xml.sax.handler.feature_namespaces, True
). The name is a tuple with the URI for
the namespace and the tag name. The qname is
the fully qualified text name. The attrs
parameter is an xml.sax.Attributes object.
This object is reused by the parser; your handler cannot save this
object. The xml.sax.Attributes object behaves somewhat like a
mapping. It doesn't support the [] operator for
getting values, but does support get,
has_key, items,
keys, and values
methods.
endElementNS(name, qname)
The parser calls this method with each tag that is found, in
namespace mode. The name is a tuple with the
URI for the namespace and the tag name. The
qname is the fully qualified text name.
characters(content)
The parser uses this method to provide character data to the
ContentHandler. The parser may provide
character data in a single chunk, or it may provide the characters
in several chunks.
ignorableWhitespace(whitespace)
The parser will use this method to provide ignorable
whitespace to the ContentHandler. This is
whitespace between tags, usually line breaks and indentation. The
parser may provide whitespace in a single chunk, or it may provide
the characters in several chunks.
processingInstructions(target, data)
The parser will provide all
<?targetdata?> processing
instructions to this method. Note that the initial <?xml
version="1.0" encoding="UTF-8"?> is not reported.
xml.minidom Parsing. The Document Object Model (DOM) parser creates a document object
model from your XML document. The parser transforms the text of an XML
document into a DOM object. Once your program has the DOM object, you
can examine that object.
Here's a short example that shows the essentials of building a
simple XML parser with the xml.dom module. This
example defines a simple ContentHandler that
prints the tags as well as counting the occurances of the
<informaltable> tag.
We defined a walkNode function which does a
recursive, depth-first traversal of the elements in the document
structure. In many applications, the structure of the XML document is
well known, and functions which are tied to the structure of the
document can be used. In this example, we're reading a DocBook XML file,
which has a complex, highly-nested structure.
import xml.dom.minidom
tables= []
def walkNode( n, depth=0 ):
print depth*' ', n.tagName
if n.tagName == "informaltable":
tables.append( n )
for d in n.childNodes:
if d.nodeType == xml.dom.Node.ELEMENT_NODE:
walkNode( d, depth+1 )
dom1 = xml.dom.minidom.parse("../p5-projects.xml")
walkNode( dom1.documentElement )
print tables
The DOM Object Model. The heart of a DOM parser is the DOM class hierarchy. Your
program will work with a xml.dom.Document
object. We'll look at a few essential classes of the DOM. There are
other classes in this model, described in section 13.6.2 of the
Python Library Reference. We'll focus on the
most commonly-used classes.
The XML Document Object Model is a standard definition. The
standard applies to both Java programs as well as Python. The
xml.dom package provides definitions which meet
this standard. The standard doesn't address how XML is parsed to create
this structure. Consequently, the xml.dom package
has no official parser. You could, for example, use a SAX parser to
produce a DOM structure. Your handler would create objects from the
classes defined in xml.dom.
The xml.dom.minidom package is an
implementation of the DOM standard, which is slightly simplified. This
implementation of the standard is extended to include a parser. The
essential class definitions, however, come from
xml.dom. We'll only look at methods used to get
data from an XML document. We'll ignore the additional methods used by a
parser to build a DOM object.
class Node
The Node class is the superclass for
all of the various DOM classes. It defines a number of attributes
and methods which are common to all of the various subclasses.
This class should be thought of as abstract: it is not used
directly; it exists to provide common features to all of the
subclasses.
Here are the attributes which are common to all of the
various kinds of Nodes
nodeType
This is an integer code that discriminates among the
subclasses of Node. There are a
number of helpful symbolic constants which are class
variables in xml.dom.Node. These constants define the
various types of Nodes. ELEMENT_NODE,
ATTRIBUTE_NODE,
TEXT_NODE,
CDATA_SECTION_NODE,
ENTITY_NODE,
PROCESSING_INSTRUCTION_NODE,
COMMENT_NODE,
DOCUMENT_NODE,
DOCUMENT_TYPE_NODE,
NOTATION_NODE.
attributes
This is a map-like collection of attributes. It is an
instance of xml.dom.NamedNodeMap. It
has method functions including get,
getNamedItem,
getNamedItemNS,
has_key,
item,
items,
itemsNS,
keys,
keysNS,
length,
removeNamedItem,
removeNamedItemNS,
setNamedItem,
setNamedItemNS,
values. The
item and
length methods are defined by the
standard and provided for Java compatibility.
localName
If there is a namespace, then this is the portion of
the name after the colon. If there is no namespace, this is
the entire tag name.
prefix
If there is a namespace, then this is the portion of
the name before the colon. If there is no namespace, this is
an empty string.
namespaceURI
If there is a namespace, this is the URI for that
namespace. If there is no namespace, this is
None.
parentNode
This is the parent of this
Node. The
DocumentNode
will have None for this attribute, since
it is the parent of all Nodes in the
document. For all other Nodes, this
is the context in which the Node
appears.
previousSibling
Sibling Nodes share a common
parent. This attribute of a Node is
the Node which precedes it within a
parent. If this is the first Node
under a parent, the previousSibling will
be None. Often, the preceeding
Node will be a
Text containing whitespace.
nextSibling
Sibling Nodes share a common
parent. This attribute of a Node is
the Node which follows it within a
parent. If this is the last Node
under a parent, the nextSibling will be
None. Often, the following
Node will be
Text containing whitespace.
childNodes
The list of child Nodes under this Node. Generally,
this will be a xml.dom.NodeList
instance, not a simple Python list. A
NodeList behaves like a
list, but has two extra methods:
item and
length, which are defined by the
standard and provided for Java compatibility.
firstChild
The first Node in the
childNodes list, similar to
childNodes[:1]. It will be None if the
childNodes list is also empty.
lastChild
The last Node in the
childNodes list, similar to
childNodes[-1:]. It will be None if the
childNodes list is also empty.
Here are some attributes which are overridden in each
subclass of Node. They have slightly
different meanings for each node type.
nodeName
A string with the "name" for this
Node. For an
Element, this will be the same as the
tagName attribute. In some cases, it will
be None.
nodeValue
A string with the "value" for this
Node. For an
Text, this will be the same as the
data attribute. In some cases, it will be
None.
Here are some methods of a
Node.
hasAttributes
This function returns True if there
are attributes associated with this
Node.
hasChildNodes
This function returns True if there child
Nodes associated with this
Node.
class Document( Node )
This is the top-level document, the object returned by the
parser. It is a subclass of Node, so it
inherits all of those attributes and methods. The
Document class adds some attributes and
method functions to the Node
definition.
documentElement
This attribute refers to the top-most
Element in the XML document. A
Document may contain
DocumentType,
ProcessingInstruction and
CommentNodes,
also. This attribute saves you having to dig through the
childNodes list for the top
Element.
getElementsByTagName(tagName)
This function returns a
NodeList with each
Element in this
Document that has the given tag
name.
getElementsByTagNameNS(namespaceURI, tagName)
This function returns a
NodeList with each
Element in this
Document that has the given namespace
URI and local tag name.
class Element( Node )
This is a specific element within an XML document. An
element is surrounded by XML tags. In <para
id="sample">Text</para>, the tag is
<para>, which provides the name for the
Element. Most
Elements will have children, some will have
Attributes as well as children. The
Element class adds some attributes and
method functions to the Node
definition.
tagName
The full name for the tag. If there is a namesace,
this will be the complete name, including colons. This will
also be in nodeValue.
getElementsByTagName(tagName)
This function returns a
NodeList with each
Element in this
Element that has the given tag
name.
getElementsByTagNameNS(namespaceURI, tagName)
This function returns a
NodeList with each
Element in this
Element that has the given namespace
URI and local tag name.
hasAttribute(name)
Returns True if this
Element has an
Attr with the given name.
hasAttributeNS(namespaceURI, localName)
Returns True if this
Element has an
Attr with the given name based on the
namespace and localName.
getAttribute(name)
Returns the string value of the
Attr with the given name. If the
attribute doesn't exist, this will return a zero-length
string.
getAttributeNS(namespaceURI, localName)
Returns the string value of the
Attr with the given name. If the
attribute doesn't exist, this will return a zero-length
string.
getAttributeNode(name)
Returns the Attr with the given
name. If the named attribute doesn't exist, this method
returns None.
getAttributeNodeNS(namespaceURI, localName)
Returns the Attr with the given
name. If the named attribute doesn't exist, this method
returns None.
class Attr( Node )
This is an attribute, within an Element. In <para
id="sample">Text</para>, the tag is
<para>; this tag has an attribute of
id with a value of sample. Generally,
the nodeType, nodeName and
nodeValue attributes are all that are used. The
Attr class adds some attributes to the
Node definition.
name
The full name of the attribute, which may include
colons. The Node class defines
localName, prefix and
namespaceURI which may be necessary for
correctly processing this attribute.
value
The string value of the attribute. Also note that
nodeValue will have a copy of the
attribute's value.
class Text( Node ) and class CDATASection( Node )
This is the text within an element. In <para
id="sample">Text</para>, the text is
Text. Note that end of line characters and
indentation also count as Text nodes.
Further, the parser may break up a large piece of text into a
number of smaller Text nodes. The
Text class adds an attribute to the
Node definition.
data
The text. Also note that nodeValue
will have a copy of the text.
class Comment( Node )
This is the text within a comment. The <!--
and --> characters are not included. The
Comment class adds an attribute to the
Node definition.
data
The comment. Also note that
nodeValue will have a copy of the
comment.