XML Files: The xml.minidom and xml.sax Modules

XML files are text files, intended for human consumption, that mix markup with content. The markup uses a number of relatively simple rules. Additionally, there are structural requirements that assure that an XML file has a minimal level of validity. There are additional rules (either a Document Type Defintion, DTD, or an XML Schema Definition, XSD) that provide additional structural rules.

There are three separate XML parsers available with Python. We'll ignore the xml.expat module (not for any good reason), and focus on the xml.sax and xml.minidom parsers.

xml.sax Parsing. The Standard API for XML (SAX) parser is described as an event parser. The parser recognizes different elements of an XML document and invokes methods in a handler which you provide. Your handler will be given pieces of the document, and can do appropriate processing with those pieces.

For most XML processing, your program will have the following outline: This parser will then use your ContentHandler as it parses.

  1. Define a subclass of xml.sax.ContentHandler. The methods of this class will do your unique processing will happen.

  2. Request the module to create an instance of an xml.sax.Parser.

  3. Create an instance of your handler class. Provide this to the parser you created.

  4. Set any features or options in the parser.

  5. Invoke the parser on your document (or incoming stream of data from a network socket).

Here's a short example that shows the essentials of building a simple XML parser with the xml.sax module. This example defines a simple ContentHandler that prints the tags as well as counting the occurances of the <informaltable> tag.

import xml.sax

class DumpDetails( xml.sax.ContentHandler ):
    def __init__( self ):
        self.depth= 0
        self.tableCount= 0
    def startElement( self, aName, someAttrs ):
        print self.depth*' ' + aName
        self.depth += 1
        if aName == 'informaltable':
            self.tableCount += 1
    def endElement( self, aName ):
        self.depth -= 1
    def characters( self, content ):
        pass # ignore the actual data

p= xml.sax.make_parser()
myHandler= DumpDetails()
p.setContentHandler( myHandler )
p.parse( "../p5-projects.xml" )
print myHandler.tableCount, "tables"

Since the parsing is event-driven, your handler must accumulate any context required to determine where the individual tags occur. In some content models (like XHTML and DocBook) there are two levels of markup: structural and semantic. The structural markup includes books, parts, chapters, sections, lists and the like. The semantic markup is sometimes called "inline" markup, and it includes tags to identify function names, class names, exception names, variable names, and the like. When processing this kind of document, you're application must determine the which tag is which.

A ContentHandler Subclass. The heart of a SAX parser is the subclass of ContentHandler that you define in your application. There are a number of methods which you may want to override. Minimally, you'll override the startElement and characters methods. There are other methods of this class described in section 13.10.1 of the Python Library Reference.

setDocumentLocator( locator )

The parser will call this method to provide an xml.sax.Locator object. This object has the XML document ID information, plus line and column information. The locator will be updated within the parser, so it should only be used within these handler methods.

startDocument

The parser will call this method at the start of the document. It can be used for initialization and resetting any context information.

endDocument

This method is paired with the startDocument method; it is called once by the parser at the end of the document.

startElement( name , attrs )

The parser calls this method with each tag that is found, in non-namespace mode. The name is the string with the tag name. The attrs parameter is an xml.sax.Attributes object. This object is reused by the parser; your handler cannot save this object. The xml.sax.Attributes object behaves somewhat like a mapping. It doesn't support the [] operator for getting values, but does support get, has_key, items, keys, and values methods.

endElement( name )

The parser calls this method with each tag that is found, in non-namespace mode. The name is the string with the tag name.

startElementNS( name , qname , attrs )

The parser calls this method with each tag that is found, in namespace mode. You set namesace mode by using the parser's p.setFeature( xml.sax.handler.feature_namespaces, True ). The name is a tuple with the URI for the namespace and the tag name. The qname is the fully qualified text name. The attrs parameter is an xml.sax.Attributes object. This object is reused by the parser; your handler cannot save this object. The xml.sax.Attributes object behaves somewhat like a mapping. It doesn't support the [] operator for getting values, but does support get, has_key, items, keys, and values methods.

endElementNS( name , qname )

The parser calls this method with each tag that is found, in namespace mode. The name is a tuple with the URI for the namespace and the tag name. The qname is the fully qualified text name.

characters( content )

The parser uses this method to provide character data to the ContentHandler. The parser may provide character data in a single chunk, or it may provide the characters in several chunks.

ignorableWhitespace( whitespace )

The parser will use this method to provide ignorable whitespace to the ContentHandler. This is whitespace between tags, usually line breaks and indentation. The parser may provide whitespace in a single chunk, or it may provide the characters in several chunks.

processingInstructions( target , data )

The parser will provide all <? target data ?> processing instructions to this method. Note that the initial <?xml version="1.0" encoding="UTF-8"?> is not reported.

xml.minidom Parsing. The Document Object Model (DOM) parser creates a document object model from your XML document. The parser transforms the text of an XML document into a DOM object. Once your program has the DOM object, you can examine that object.

Here's a short example that shows the essentials of building a simple XML parser with the xml.dom module. This example defines a simple ContentHandler that prints the tags as well as counting the occurances of the <informaltable> tag.

We defined a walkNode function which does a recursive, depth-first traversal of the elements in the document structure. In many applications, the structure of the XML document is well known, and functions which are tied to the structure of the document can be used. In this example, we're reading a DocBook XML file, which has a complex, highly-nested structure.

import xml.dom.minidom 

tables= []
def walkNode( n, depth=0 ):
    print depth*' ', n.tagName
    if n.tagName == "informaltable":
        tables.append( n )
    for d in n.childNodes:
        if d.nodeType == xml.dom.Node.ELEMENT_NODE:
            walkNode( d, depth+1 )
            
dom1 = xml.dom.minidom.parse("../p5-projects.xml")
walkNode( dom1.documentElement )
print tables

The DOM Object Model. The heart of a DOM parser is the DOM class hierarchy. Your program will work with a xml.dom.Document object. We'll look at a few essential classes of the DOM. There are other classes in this model, described in section 13.6.2 of the Python Library Reference. We'll focus on the most commonly-used classes.

The XML Document Object Model is a standard definition. The standard applies to both Java programs as well as Python. The xml.dom package provides definitions which meet this standard. The standard doesn't address how XML is parsed to create this structure. Consequently, the xml.dom package has no official parser. You could, for example, use a SAX parser to produce a DOM structure. Your handler would create objects from the classes defined in xml.dom.

The xml.dom.minidom package is an implementation of the DOM standard, which is slightly simplified. This implementation of the standard is extended to include a parser. The essential class definitions, however, come from xml.dom. We'll only look at methods used to get data from an XML document. We'll ignore the additional methods used by a parser to build a DOM object.

class Node

The Node class is the superclass for all of the various DOM classes. It defines a number of attributes and methods which are common to all of the various subclasses. This class should be thought of as abstract: it is not used directly; it exists to provide common features to all of the subclasses.

Here are the attributes which are common to all of the various kinds of Nodes

nodeType

This is an integer code that discriminates among the subclasses of Node. There are a number of helpful symbolic constants which are class variables in xml.dom.Node. These constants define the various types of Nodes. ELEMENT_NODE, ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, NOTATION_NODE.

attributes

This is a map-like collection of attributes. It is an instance of xml.dom.NamedNodeMap. It has method functions including get, getNamedItem, getNamedItemNS, has_key, item, items, itemsNS, keys, keysNS, length, removeNamedItem, removeNamedItemNS, setNamedItem, setNamedItemNS, values. The item and length methods are defined by the standard and provided for Java compatibility.

localName

If there is a namespace, then this is the portion of the name after the colon. If there is no namespace, this is the entire tag name.

prefix

If there is a namespace, then this is the portion of the name before the colon. If there is no namespace, this is an empty string.

namespaceURI

If there is a namespace, this is the URI for that namespace. If there is no namespace, this is None.

parentNode

This is the parent of this Node. The Document Node will have None for this attribute, since it is the parent of all Nodes in the document. For all other Nodes, this is the context in which the Node appears.

previousSibling

Sibling Nodes share a common parent. This attribute of a Node is the Node which precedes it within a parent. If this is the first Node under a parent, the previousSibling will be None. Often, the preceeding Node will be a Text containing whitespace.

nextSibling

Sibling Nodes share a common parent. This attribute of a Node is the Node which follows it within a parent. If this is the last Node under a parent, the nextSibling will be None. Often, the following Node will be Text containing whitespace.

childNodes

The list of child Nodes under this Node. Generally, this will be a xml.dom.NodeList instance, not a simple Python list. A NodeList behaves like a list, but has two extra methods: item and length, which are defined by the standard and provided for Java compatibility.

firstChild

The first Node in the childNodes list, similar to childNodes[:1]. It will be None if the childNodes list is also empty.

lastChild

The last Node in the childNodes list, similar to childNodes[-1:]. It will be None if the childNodes list is also empty.

Here are some attributes which are overridden in each subclass of Node. They have slightly different meanings for each node type.

nodeName

A string with the "name" for this Node. For an Element, this will be the same as the tagName attribute. In some cases, it will be None.

nodeValue

A string with the "value" for this Node. For an Text, this will be the same as the data attribute. In some cases, it will be None.

Here are some methods of a Node.

hasAttributes

This function returns True if there are attributes associated with this Node.

hasChildNodes

This function returns True if there child Nodes associated with this Node.

class Document( Node )

This is the top-level document, the object returned by the parser. It is a subclass of Node, so it inherits all of those attributes and methods. The Document class adds some attributes and method functions to the Node definition.

documentElement

This attribute refers to the top-most Element in the XML document. A Document may contain DocumentType, ProcessingInstruction and Comment Nodes, also. This attribute saves you having to dig through the childNodes list for the top Element.

getElementsByTagName( tagName )

This function returns a NodeList with each Element in this Document that has the given tag name.

getElementsByTagNameNS( namespaceURI , tagName )

This function returns a NodeList with each Element in this Document that has the given namespace URI and local tag name.

class Element( Node )

This is a specific element within an XML document. An element is surrounded by XML tags. In <para id="sample">Text</para>, the tag is <para>, which provides the name for the Element. Most Elements will have children, some will have Attributes as well as children. The Element class adds some attributes and method functions to the Node definition.

tagName

The full name for the tag. If there is a namesace, this will be the complete name, including colons. This will also be in nodeValue.

getElementsByTagName( tagName )

This function returns a NodeList with each Element in this Element that has the given tag name.

getElementsByTagNameNS( namespaceURI , tagName )

This function returns a NodeList with each Element in this Element that has the given namespace URI and local tag name.

hasAttribute( name )

Returns True if this Element has an Attr with the given name.

hasAttributeNS( namespaceURI , localName )

Returns True if this Element has an Attr with the given name based on the namespace and localName.

getAttribute( name )

Returns the string value of the Attr with the given name. If the attribute doesn't exist, this will return a zero-length string.

getAttributeNS( namespaceURI , localName )

Returns the string value of the Attr with the given name. If the attribute doesn't exist, this will return a zero-length string.

getAttributeNode( name )

Returns the Attr with the given name. If the named attribute doesn't exist, this method returns None.

getAttributeNodeNS( namespaceURI , localName )

Returns the Attr with the given name. If the named attribute doesn't exist, this method returns None.

class Attr( Node )

This is an attribute, within an Element. In <para id="sample">Text</para>, the tag is <para>; this tag has an attribute of id with a value of sample. Generally, the nodeType, nodeName and nodeValue attributes are all that are used. The Attr class adds some attributes to the Node definition.

name

The full name of the attribute, which may include colons. The Node class defines localName, prefix and namespaceURI which may be necessary for correctly processing this attribute.

value

The string value of the attribute. Also note that nodeValue will have a copy of the attribute's value.

class Text( Node ) and class CDATASection( Node )

This is the text within an element. In <para id="sample">Text</para>, the text is Text. Note that end of line characters and indentation also count as Text nodes. Further, the parser may break up a large piece of text into a number of smaller Text nodes. The Text class adds an attribute to the Node definition.

data

The text. Also note that nodeValue will have a copy of the text.

class Comment( Node )

This is the text within a comment. The <!-- and --> characters are not included. The Comment class adds an attribute to the Node definition.

data

The comment. Also note that nodeValue will have a copy of the comment.


   
  Published under the terms of the Open Publication License Design by Interspire