File Formats: CSV, Tab, XML, Logs and Others

We looked at general features of the file system in Files. In this chapter we’ll look at Python techniques for handling files in a few of the innumeraable formats that are in common use. Most file formats are relatively easy to handle with Python techniques we’ve already seen. Comma-Separated Values (CSV) files, XML files and packed binary files, however, are a little more sophisticated.

This only the tip of the iceberg in the far larger problem called persistence. In addition to simple file system persistence, we also have the possibility of object persistence using an object database. In this case, the databse processing lies between our program and the file system on which the database resides. This area also includes object-relational mapping, where our program relies on a mapper; the mapper uses to database, and the database manages the file system. We can’t explore the whole persistence problem in this chapter.

In this chapter we’ll present a conceptual overview of the various approaches to reading and writing files in Overview. We’ll look at reading and writing CSV files in Comma-Separated Values: The csv Module, tab-delimited files in Tab Files: Nothing Special. We’ll look reading property files in Property Files and Configuration (or .INI ) Files: The ConfigParser Module. We’ll look at the subleties of processing legacy COBOL files in Fixed Format Files, A COBOL Legacy: The codecs Module. We’ll cover the basics of reading XML files in XML Files: The xml.etree and xml.sax Modules.

Most programs need a way to write sophisticated, easy-to-control log files what contain status and debugging information. For simple one-page programs, the print statement is fine. As soon as we have multiple modules, where we need more sophisticated debugging, we find a need for the logging module. Of course, any program that requires careful auditing will benefit from the logging module. We’ll look at creating standard logs in Log Files: The logging Module.

Overview

When we introduced the concept of file we mentioned that we could look at a file on two levels.

  • A file is a sequence of bytes. This is the OS’s view of views, as it is the lowest-common denominator.
  • A file is a sequence of data objects, represented as sequences of bytes.

A file format is the processing rules required to translate between usable Python objects and sequences of bytes. People have invented innumerable distinct file formats. We’ll look at some techniques which should cover most of the bases.

We’ll look at three broad families of files: text, binary and pickled objects. Each has some advantages and processing complexities.

  • Text files are designed so that a person can easily read and write them. We’ll look at several common text file formats, including CSV, XML, Tab-delimited, property-format, and fixed position. Since text files are intended for human consumption, they are difficult to update in place.
  • Binary files are designed to optimize processing speed or the overall size of the file. Most databases use very complex binary file formats for speed. A JPEG file, on the other hand, uses a binary format to minimize the size of the file. A binary-format file will typically place data at known offsets, making it possible to do direct access to any particular byte using the seek() method of a Python file object.
  • Pickled Objects are produced by Python’s pickle or shelve modules. There are several pickle protocols available, including text and binary alternatives. More importantly, a pickled file is not designed to be seen by people, nor have we spent any design effort optimizng performace or size. In a sense, a pickled object requires the least design effort.

Comma-Separated Values: The csv Module

Often, we have data that is in Comma-Separated Value (CSV) format. This used by many spreadsheets and is a widely-used standard for data files.

In Reading a CSV File the Hard Way we parsed CSV files using simple string manipulations. The csv module does a far better job at parsing and creating CSV files than the programming we showed in those examples.

About CSV Files. CSV files are text files organized around data that has rows and columns. This format is used to exchange data between spread-sheet programs or databases. A CSV file uses a number of punctuation rules to encode the data.

  • Each row is delimited by a line-ending sequence of characters. This is usually the ASCII sequence rn. Since this may not be the default way to process text files on your platform, you have to open files using the “rb” and “wb” modes.
  • Within a row, columns are delimited by a ,. To handle the situation where a column’s data contains a ,, the column data may be quoted; surrounded by " characters. If the column contains a ", there are two common rules used. One CSV dialect uses an escape character, usually \". The other dialect uses double "".

In the ideal case, a CSV file will have the same number of columns in each row, and the first row will be column titles. Almost as pleasant is a file without column titles, but with a known sequence of columns. In the more complex cases, the number of columns per row varies.

The csv Module. The CSV module provides you with readers or writers; these are objects which use an existing file object, created with the file() or open() function. A CSV reader will read a file, parsing the commas and quotes, delivering you the data elements of each row in a sequence or mapping. A CSV writer will create a file, adding the necessary commas and quotes to create a valid CSV file.

The following constructors within the csv module are used to create a reader, DictReader, writer or DictWriter.

csv.reader(csvfile) → reader

Creates a reader object which can parse the given file, returning a sequence of values for each line of the file. The csvfile can be any iterable object.

This can be used as follows.

rdr= csv.reader( open( "file.csv", "rb" ) )
for row in rdr:
    print row
csv.writer(csvfile) → writer

Creates a writer object which can format a sequence of values and write them to a line of the file. The csvfile can be any object which supports a write() method.

This can be used as follows.

target= open( "file.csv", "wb" )
wtr= csv.writer( target )
wtr.writerow( ["some","list","of","values"] )
target.close()

It’s very handy to use the with statement to assure that the file is properly closed.

with open( "file.csv", "wb" ) as target:
    wtr= csv.writer( target )
    wtr.writerow( ["some","list","of","values"] )
csv.DictReader(csvfile[, fieldnames]) → DictReader
Creates a DictReader object which can parse the given file, returning a dictionary of values for each line of the file. The dictionary keys are typically the first line of the file. You can, optionally, provide the field names if they are not the first line of the file. The csvfile can be any iterable object.
csv.DictWriter(csvfile[, fieldnames]) → DictWriter
Creates a DictWriter object which can format a dictionary of values and write them to a line of the file. You must provide a sequence of field names which is used to format each individual dictionary entry. The csvfile can be any object which supports a write() method.

Reader Functions. The following functions within a reader (or DictReader) object will read and parse the CSV file.

Writer Functions. The following functions with a writer (or DictWriter) object will format and write a CSV file.

Basic CSV Reading Example.

The basic CSV reader processing treats each line of the file as data. This is typical for files which lack column titles, or files which have such a complex format that special parsing and analysis is required. In some cases, a file has a simple, regular format with a single row of column titles, which can be processed by a special reader we’ll look at below.

We’ll revise the readquotes.py program from Reading a CSV File the Hard Way. This will properly handle all of the quoting rules, eliminating a number of irritating problems with the example in the previous chapter.

readquotes2.py

import csv
qFile= file( "quotes.csv", "rb" )
csvReader= csv.reader( qFile )
for q in csvReader:
    try:
        stock, price, date, time, change, opPrc, dHi, dLo, vol = q
        print stock, float(price), date, time, change, vol
    except ValueError:
        pass
qFile.close()
  1. We open our quotes file, quotes.csv, for reading, creating an object named qFile.
  2. We create a csv.reader object which will parse this file for us, transforming each line into a sequence of individual column values.
  3. We use a for statement to iterate through the sequence of lines in the file.
  4. In the unlikely event of an invalid number for the price, we surround this with a try statement. The invalid number line will raise a ValueError exception, which is caught in the except clause and quietly ignored.
  5. Each stock quote, q, is a sequence of column values. We use multiple assignment to assign each field to a relevant variable. We don’t need to strip whitespace, split the string, or handle quotes; the reader already did this.
  6. Since the price is a string, we use the float() function to convert this string to a proper numeric value for further processing.

Column Headers as Dictionary Keys In some cases, you have a simple, regular file with a single line of column titles. In this case, you can transform each line of the file into a dictionary. The key for each field is the column title. This can lead to programs which are more clear, and more flexible. The flexibility comes from not assuming a specific order to the columns.

We’ll revise the readportfolio.py program from Reading “Records”. This will properly handle all of the quoting rules, eliminating a number of irritating problems with the example in the previous chapter. It will make use of the column titles in the file.

readportfolio2.py

import csv
quotes=open( "display.csv", "rb" )
csvReader= csv.DictReader( quotes )
invest= 0
current= 0
for data in csvReader:
    print data
    invest += float(data["Purchase Price"])*float(data["# Shares"])
    current += float(data["Price"])*float(data["# Shares"])
print invest, current, (current-invest)/invest
  1. We open our portfolio file, display.csv, for reading, creating a file object named quotes.

  2. We create a csv.DictReader object from our quotes file. This will read the first line of the file to get the column titles; each subsequent line will be parsed and transformed into a dictionary.

  3. We initialize two counters, invest and current to zero. These will accumulate our initial investment and the current value of this portfolio.

  4. We use a for statement to iterate through the lines in quotes file. Each line is parsed, and the column titles are used to create a dictionary, which is assigned to data.

  5. Each stock quote, q, is a string. We use the strip() operation to remove excess whitespace characters; the string which is created then performs the split() method to separate the fields into a list. We assign this list to the variable values.

  6. We perform some simple calculations on each dict. In this case, we convert the purchase price to a number, convert the number of shares to a number and multiply to determine how much we spent on this stock. We accumulate the sum of these products into invest.

    We also convert the current price to a number and multiply this by the number of shares to get the current value of this stock. We accumulate the sum of these products into current.

  7. When the loop has terminated, we can write out the two numbers, and compute the percent change.

Writing CSV Files The most general case for writing CSV is shown in the following example. Assume we’ve got a list of objects, named someList. Further, let’s assume that each object has three attributes: this, that and aKey.

import csv
myFile= open( " :replaceable:`result` ", "wb" )
wtr= csv.writer( myFile )
for someData in :replaceable:`someList` :
    aRow= [ someData.this, someData.that, someData.aKey, ]
    wtr.writerow( aRow )
myFile.close()

In this case, we assemble the list of values that becomes a row in the CSV file.

In some cases we can provide two methods to allow our classes to participate in CSV writing. We can define a csvRow() method as well as a csvHeading() method. These methods will provide the necessary tuples of heading or data to be written to the CSV file.

For example, let’s look at the following class definition for a small database of sailboats. This class shows how the csvRow() and csvHeading() methods might look.

class Boat( object ):
    csvHeading= [ "name", "rig", "sails" ]
    def __init__( aBoat, name, rig, sails ):
        self.name= name
        self.rig= rig
        self.sails= sails
    def __str__( self ):
        return "%s (%s, %r)" % ( self.name, self.rig, self.sails )
    def csvRow( self ):
        return [ self.name, self.rig, self.sails ]

Including these methods in our class definitions simplifies the loop that writes the objects to a CSV file. Instead of building each row as a list, we can do the following: wtr.writerow( someData.csvRow() ) .

Here’s an example that leverages each object’s internal dictionary (__dict__) to dump objects to a CSV file.

db= [
    Boat( "KaDiMa", "sloop", ( "main", "jib" ) ),
    Boat( "Glinda", "sloop", ( "main", "jib", "spinnaker" ) ),
    Boat( "Eillean Glas", "sloop", ( "main", "genoa" ) ),
    ]
test= file( "boats.csv", "wb" )
wtr= csv.DictWriter( test, Boat.csvHeading )
wtr.writerow( dict( zip( Boat.csvHeading, Boat.csvHeading ) ) )
for d in db:
    wtr.writerow( d.__dict__ )
test.close()

Tab Files: Nothing Special

Tab-delimited files are text files organized around data that has rows and columns. This format is used to exchange data between spread-sheet programs or databases. A tab-delimited file uses just rwo punctuation rules to encode the data.

  • Each row is delimited by an ordinary newline character. This is usually the standard n. If you are exchanging files across platforms, you may need to open files for reading using the “rU” mode to get universal newline handling.
  • Within a row, columns are delimited by a single character, often t. The column punctuation character that is chosen is one that will never occur in the data. It is usually (but not always) an unprintable character like t.

In the ideal cases, a CSV file will have the same number of columns in each row, and the first row will be column titles. Almost as pleasant is a file without column titles, but with a known sequence of columns. In the more complex cases, the number of columns per row varies.

When we have a single, standard punctuation mark, we can simply use two operations in the string and list classes to process files. We use the split() method of a string to parse the rows. We use the join() method of a list to assemble the rows.

We don’t actually need a separate module to handle tab-delimited files.

Reading. The most general case for reading Tab-delimited data is shown in the following example.

myFile= open( " :replaceable:`somefile` ", "rU" )
for aRow in myFile:
    print aRow.split('\t')
myFile.close()

Each row will be a list of column values.

Writing. The writing case is the inverse of the reading case. Essentially, we use a "t".join( someList ) to create the tab-delimeted row. Here’s our sailboat example, done as tab-delimited data.

test= file( "boats.tab", "w" )
test.write( "\t".join( Boat.csvHeading ) )
test.write( "\n" )
for d in db:
    test.write( "\t".join( map( str, d.csvRow() ) ) )
    test.write( "\n" )
test.close()

Note that some elements of our data objects aren’t string values. In this case, the value for sails is a tuple, which needs to be converted to a proper string. The expression map(str, someList ) applies the str() function to each element of the original list, creating a new list which will have all string values. See Sequence Processing Functions: map(), filter() and reduce().

Property Files and Configuration (or .INI ) Files: The ConfigParser Module

A property file, also known as a configuration (or .INI) file defines property or configuration values. It is usually just a collection of settings. The essential property-file format has a simple row-oriented format with only two values in each row. A configuration (or .INI) file organizes a simple list of properties into one or more named sections.

A property file uses a few punctuation rules to encode the data.

  • Lines begining with # or ; are ignored. In some dialects the comments are # and !.
  • Each property setting is delimited by an ordinary newline character. This is usually the standard n. If you are exchanging files across platforms, you may need to open files for reading using the “rU” mode to get universal newline handling.
  • Each property is a simple name and a value. The name is a string characters that does not use a separator character of : or =. The value is everything after the punctuation mark, with leading and trailing spaces removed. In some dialects space is also a separator character.

Some property file dialects allow a value to continue on to the next line. In this case, a line that ends with \ (the cwo-character sequence \ \n) escapes the usual meaning of \n. Rather being the end of a line, \n is demoted to just another whitespace character.

A property file is an extension to the basic tab-delimited file. It has just two columns per line, and some space-stripping is done. However, it doesn’t have a consistent separator, so it is slightly more complex to parse.

The extra feature introduced in a configuration file is named sections.

  • A line beginning with [, ending with ], is the beginning of a section. The []‘s surround the section name. All of the lines from here to the next section header are collected together.

Reading a Simple Property File. Here’s an example of reading the simplest kind of property file. In this case, we’ll turn the entire file into a dictionary. Python doesn’t provide a module for doing this. The processing is a sequence string manipulations to parse the file.

propFile= file( r"C:\Java\jdk1.5.0_06\jre\lib\logging.properties", "rU" )
propDict= dict()
for propLine in propFile:
    propDef= propLine.strip()
    if len(propDef) == 0:
        continue
    if propDef[0] in ( '!', '#' ):
        continue
    punctuation= [ propDef.find(c) for c in ':= ' ] + [ len(propDef) ]
    found= min( [ pos for pos in punctuation if pos != -1 ] )
    name= propDef[:found].rstrip()
    value= propDef[found:].lstrip(":= ").rstrip()
    propDict[name]= value
propFile.close()
print propDict
print propDict['handlers']

The input line is subject to a number of processing steps.

  1. First the leading and trailing whitespace is removed. If the line is empty, nothing more needs to be done.
  2. If the line begins with ! or # ( ; in some dialects) it is ignored.
  3. We find the location of all relevant puncuation marks. In some dialects, space is not permitted. Note that we through the length of the line on the end to permit a single word to be a valid property name, with an implicit value of a zero-length string.
  4. By discarding punction positions of -1, we are only processing the positions of punctuation marks which actually occur in the string. The smallest of these positions is the left-most punctuation mark.
  5. The name is everything before the punctuation mark with whitespace remove.
  6. The value is everything after the punctuaion mark. Any additional separators are removed, and any trailing whitespace is also removed.

Reading a Config File. The ConfigParser module has a number of classes for processing configuration files. You initialize a ConfigParse object with default values. The object can the read one or more a configuration files. You can then use methods to determine what sections were present and what options were defined in a given section.

import ConfigParser
cp= ConfigParser.RawConfigParser( )
cp.read( r"C:\Program Files\Mozilla Firefox\updater.ini" )
print cp.sections()
print cp.options('Strings')
print cp.get('Strings','info')

Eschewing Obfuscation. While a property file is rather simple, it is possible to simplify property files further. The essential property definition syntax is so close to Python’s own syntax that some applications use a simple file of Python variable settings. In this case, the settings file would look like this.

settings.py

# Some Properties
TITLE = "The Title String"
INFO = """The information string.
Which uses Python's ordinary techniques
for long lines."""

This file can be introduced in your program with one statement: import settings . This statement will create module-level variables, settings.TITLE and settings.INFO.

Fixed Format Files, A COBOL Legacy: The codecs Module

Files that come from COBOL programs have three characteristic features:

  • The file layout is defined positionally. There are no delimiters or separators on which to base file parsing. The file may not even have n characters at the end of each record.
  • They’re usually encoded in EBCDIC, not ASCII or Unicode.
  • They may include packed decimal fields; these are numeric values represented with two decimal digits (or a decimal digit and a sign) in each byte of the field.

The first problem requires figuring the starting position and size of each field. In some cases, there are no gaps (or filler) between fields; in this case the sizes of each field are all that are required. Once we have the position and size, however, we can use a string slice operation to pick those characters out of a record. The code is simply aLine[start:start+size].

We can tackle the second problem using the codecs module to decode the EBCDIC characters. The result of codecs.getdecoder('cp037') is a function that you can use as an EBCDIC decoder.

The third problem requires that our program know the data type as well as the position and offset of each field. If we know the data type, then we can do EBCDIC conversion or packed decimal conversion as appropriate. This is a much more subtle algorithm, since we have two strategies for converting the data fields. See Strategy for some reasons why we’d do it this way.

In order to mirror COBOL’s largely decimal world-view, we will need to use the decimal module for all numbers and airthmetic.

We note that the presence of packed decimal data changes the file from text to binary. We’ll begin with techniques for handling a text file with a fixed layout. However, since this often slides over to binary file processing, we’ll move on to that topic, also.

Reading an All-Text File. If we ignore the EBCDIC and packed decimal problems, we can easily process a fixed-layout file. The way to do this is to define a handy structure that defines our record layout. We can use this structure to parse each record, transforming the record from a string into a dictionary that we can use for further processing.

In this example, we also use a generator function, yieldRecords(), to break the file into individual records. We separate this functionality out so that our processing loop is a simple for statement, as it is with other kinds of files. In principle, this generator function can also check the length of recBytes before it yields it. If the block of data isn’t the expected size, the file was damaged and an exception should be raised.

layout = [
    ( 'field1', 0, 12 ),
    ( 'field2', 12, 4 ),
    ( 'anotherField', 16, 20 ),
    ( 'lastField', 36, 8 ),
]
reclen= 44
def yieldRecords( aFile, recSize ):
    recBytes= aFile.read(recSize)
    while recBytes:
        yield recBytes
        recBytes= aFile.read(recSize)
cobolFile= file( 'my.cobol.file', 'rb' )
for recBytes in yieldRecords(cobolFile, reclen):
    record = dict()
    for name, start, size in layout:
        record[name]= recBytes[start:start+len]

Reading Mixed Data Types. If we have to tackle the complete EBCDIC and packed decimal problem, we have to use a slightly more sophisticated structure for our file layout definition. First, we need some data conversion functions, then we can use those functions as part of picking apart a record.

We may need several conversion functions, depending on the kind of data that’s present in our file. Minimally, we’ll need the following two functions.

display(bytes) → string

This function is used to get character data. In COBOL, this is called display data. It will be in EBCDIC if our files originated on a mainframe.

def display( bytes ):
    return bytes
packed(bytes) → string

This function is used to get packed decimal data. In COBOL, this is called COMP-3 data. In our example, we have not dealt with the insert of the decimal point prior to the creation of a decimal.Decimal object.

import codecs
display = codecs.getdecoder('cp037')
def packed( bytes ):
    n= [ '' ]
    for b in bytes[:-1]:
        hi, lo = divmod( ord(b), 16 )
        n.append( str(hi) )
        n.append( str(lo) )
    digit, sign = divmod( ord(bytes[-1]), 16 )
    n.append( str(digit) )
    if sign in (0x0b, 0x0d ):
        n[0]= '-'
    else:
        n[0]= '+'
    return n

Given these two functions, we can expand our handy record layout structure.

layout = [
    ( 'field1', 0, 12, display ),
    ( 'field2', 12, 4, packed ),
    ( 'anotherField', 16, 20, display ),
    ( 'lastField', 36, 8, packed ),
]
reclen= 44

This changes our record decoding to the following.

cobolFile= file( 'my.cobol.file', 'rb' )
for recBytes in yieldRecords(cobolFile, reclen):
    record = dict()
    for name, start, size, convert in layout:
        record[name]= convert( recBytes[start:start+len] )

This example underscores some of the key values of Python. Simple things can be kept simple. The layout structure, which describes the data, is both easy to read, and written in Python itself. The evolution of this example shows how adding a sophisticated feature can be done simply and cleanly.

At some point, our record layout will have to evolve from a simple tuple to a proper class definition. We’ll need to take this evolutionary step when we want to convert packed decimal numbers into values that we can use for further processing.

XML Files: The xml.etree and xml.sax Modules

XML files are text files, intended for human consumption, that mix markup with content. The markup uses a number of relatively simple rules. Additionally, there are structural requirements that assure that an XML file has a minimal level of validity. There are additional rules (either a Document Type Defintion, DTD, or an XML Schema Definition, XSD) that provide additional structural rules.

There are several XML parsers available with Python.

xml.expat
We’ll ignore this parser, not for any particularly good reason.
xml.sax
We’ll look at the SAX parser because it provides us a way to break gigantic XML files into more manageable chunks.
xml.dom
This is a document object model (DOM) for XML.
xml.minidom
This is a stripped-down implementation of the XML document object model, along with a parser to build the document objects from XML.
xml.pulldom
This module uses SAX to create a document objects from a portion of a larger XML document.
xml.etree
This is a more useful DOM-oriented parser that allows sophisticated XPATH-like searching through the resulting document objects.

xml.sax Parsing. The Standard API for XML (SAX) parser is described as an event parser. The parser recognizes different elements of an XML document and invokes methods in a handler which you provide. Your handler will be given pieces of the document, and can do appropriate processing with those pieces.

For most XML processing, your program will have the following outline: This parser will then use your ContentHandler as it parses.

  1. Define a subclass of xml.sax.ContentHandler. The methods of this class will do your unique processing will happen.
  2. Request the module to create an instance of an xml.sax.Parser.
  3. Create an instance of your handler class. Provide this to the parser you created.
  4. Set any features or options in the parser.
  5. Invoke the parser on your document (or incoming stream of data from a network socket).

Here’s a short example that shows the essentials of building a simple XML parser with the xml.sax module. This example defines a simple ContentHandler that prints the tags as well as counting the occurances of the <informaltable> tag.

import xml.sax
class DumpDetails( xml.sax.ContentHandler ):
    def __init__( self ):
        self.depth= 0
        self.tableCount= 0
    def startElement( self, aName, someAttrs ):
        print self.depth*' ' + aName
        self.depth += 1
        if aName == 'informaltable':
            self.tableCount += 1
    def endElement( self, aName ):
        self.depth -= 1
    def characters( self, content ):
        pass # ignore the actual data

p= xml.sax.make_parser()
myHandler= DumpDetails()
p.setContentHandler( myHandler )
p.parse( "../p5-projects.xml" )
print myHandler.tableCount, "tables"

Since the parsing is event-driven, your handler must accumulate any context required to determine where the individual tags occur. In some content models (like XHTML and DocBook) there are two levels of markup: structural and semantic. The structural markup includes books, parts, chapters, sections, lists and the like. The semantic markup is sometimes called “inline” markup, and it includes tags to identify function names, class names, exception names, variable names, and the like. When processing this kind of document, you’re application must determine the which tag is which.

A ContentHandler Subclass. The heart of a SAX parser is the subclass of ContentHandler that you define in your application. There are a number of methods which you may want to override. Minimally, you’ll override the startElement() and characters() methods. There are other methods of this class described in section 20.10.1 of the Python Library Reference.

ContentHandler.setDocumentLocator(locator)
The parser will call this method to provide an xml.sax.Locator object. This object has the XML document ID information, plus line and column information. The locator will be updated within the parser, so it should only be used within these handler methods.
ContentHandler.startDocument()
The parser will call this method at the start of the document. It can be used for initialization and resetting any context information.
ContentHandler.endDocument()
This method is paired with the startDocument() method; it is called once by the parser at the end of the document.
ContentHandler.startElement(name, attrs)

The parser calls this method with each tag that is found, in non-namespace mode. The name is the string with the tag name.

The attrs parameter is an xml.sax.Attributes object. This object is reused by the parser; your handler cannot save this object.

The xml.sax.Attributes object behaves somewhat like a mapping. It doesn’t support the [] operator for getting values, but does support get(), has_key(), items(), keys(), and values() methods.

ContentHandler.endElement(name)
The parser calls this method with each tag that is found, in non-namespace mode. The name is the string with the tag name.
ContentHandler.startElementNS(name, qname, attrs)

The parser calls this method with each tag that is found, in namespace mode. You set namesace mode by using the parser’s p.setFeature( xml.sax.handler.feature_namespaces, True ). The name is a tuple with the URI for the namespace and the tag name. The qname is the fully qualified text name.

The attrs is described above under ContentHandler.startElementNS().

ContentHandler.endElementNS(name, qname)
The parser calls this method with each tag that is found, in namespace mode. The name is a tuple with the URI for the namespace and the tag name. The qname is the fully qualified text name.
ContentHandler.characters(content)
The parser uses this method to provide character data to the ContentHandler. The parser may provide character data in a single chunk, or it may provide the characters in several chunks.
ContentHandler.ignorableWhitespace(whitespace)
The parser will use this method to provide ignorable whitespace to the ContentHandler. This is whitespace between tags, usually line breaks and indentation. The parser may provide whitespace in a single chunk, or it may provide the characters in several chunks.
ContentHandler.processingInstructions(target, data)
The parser will provide all <? target data ?> processing instructions to this method. Note that the initial <?xml version="1.0" encoding="UTF-8"?> is not reported.

xml.etree Parsing. The Document Object Model (DOM) parser creates a document object model from your XML document. The parser transforms the text of an XML document into a DOM object. Once your program has the DOM object, you can examine that object.

Here’s a short example that shows the essentials of building a simple XML parser with the xml.etree module. This example locates all instances of the <informaltable> tag in the XML document and prints parts of this tag’s content.

#!/usr/bin/env python
from xml.etree import ElementTree

dom1 = ElementTree.parse("../PythonBook-2.5/p5-projects.xml")
for t in dom1.getiterator("informaltable"):
    print t.attrib
    for row in t.find('thead').getiterator('tr'):
        print "head row"
        for header_col in row.getiterator('th'):
            print header_col.text
    for row in t.find('tbody').getiterator('tr'):
        for body_col in row.getiterator('td'):
            print body_col.text

The DOM Object Model. The heart of a DOM parser is the DOM class hierarchy.

There is a widely-used XML Document Object Model definition. This standard applies to both Java programs as well as Python. The xml.dom package provides definitions which meet this standard.

The standard doesn’t address how XML is parsed to create this structure. Consequently, the xml.dom package has no official parser. You could, for example, use a SAX parser to produce a DOM structure. Your handler would create objects from the classes defined in xml.dom.

The xml.dom.minidom package is an implementation of the DOM standard, which is slightly simplified. This implementation of the standard is extended to include a parser. The essential class definitions, however, come from xml.dom.

The standard element hierarchy is rather complex. There’s an overview of the DOM model in The DOM Class Hierarchy.

The ElementTree Document Object Model. When using xml.etree your program will work with a number of xml.etree.ElementTree objects. We’ll look at a few essential classes of the DOM. There are other classes in this model, described in section 20.13 of the Python Library Reference. We’ll focus on the most commonly-used features of this class.

class ElementTree
parse(source) → ElementTree

Generally, ElementTree processing starts with parsing an XML document. The source can either be a filename or an object that contains XML text.

The result of parsing is an object that fits the ElementTree interface, and has a number of methods for examining the structure of the document.

getroot() → Element
Return the root Element of the document.
find(match) → Element
Return the first child element matching match. This is a handy shortcut for self.getroot().find(match). See Element.find().
findall(match) → list of Elements

Locate all child elements matching match. This is a handy shortcut for self.getroot().findall(match). See Element.findall().

Returns an iterable yielding all matching elements in document order.

findtext(condition[, default=None]) → string
Locate the first child element matching match. This is a handy shortcut for self.getroot().findtext(match). See Element.findtext().
getiterator([tag=None])

Creates a tree iterator with the current element as the root. The iterator iterates over this element and all elements below it, in document (depth first) order. If tag is not None or ‘*’, only elements whose tag equals tag are returned from the iterator.

See Element.getiterator().

class Element

The ElementTree is a collection of individual Elements. Each Element is either an Element, a Comment, or a Processing Instruction. Generall, Comments and Processing Instructions behave like Elements.

tag
The tag for this element in the XML stucture.
text
Generally, this is the text found between the element tags.
tail
This holds the text found after an element’s end tag and before the next tag. Often this is simply the whitespace between tags.
attrib
A mutable mapping containing the element’s attributes.
get(name[, default=None]) → string
Fetch the value of an attribute.
items() → list of 2-tuples
Return all attributes in a list as ( name, value ) tuples.
keys() → list of strings
Return a list of all attribute names.
find(match) → Element
Return the first child element matching match. The match may be a simple tag name or and XPath expression. Returns an Element instance or None.
findall(match) → list of Elements

Locate all child elements matching match. The match may be a simple tag name or and XPath expression.

Returns an iterable yielding all matching elements in document order.

findtext(condition[, default=None]) → string
Locate the first child element matching match. The match may be a simple tag name or and XPath expression. Returns the text value of the first matching element. If the element is empty, the text will be a zero-length string. Return default if no element was found.
getiterator([tag=None])
Creates a tree iterator with the current element as the root. The iterator iterates over this element and all elements below it, in document (depth first) order. If tag is not None or ‘*’, only elements whose tag equals tag are returned from the iterator.
getchildren()
Iterate through all children. The elements are returned in document order.

When using Element.find(), Element.findall() and Element.findtext(), a simple XPATH-like syntax can be used.

Match queries can have the form "tag/tag/tag" to specify a specific grant-parent-parent-child nesting of tags. Additionally, “*” can be used as a wildcard.

For example, here’s a query that looks for a specific nesting of tags.

from xml.etree import ElementTree
dom1 = ElementTree.parse("../PythonBook-2.5/p5-projects.xml")
for t in dom1.findall("chapter/section/informaltable"):
    print t

Note that full XPATH syntax is accepted, but most of it is ignored.

Log Files: The logging Module

Most programs need a way to write sophisticated, easy-to-control log files what contain status and debugging information. Any program that requires careful auditing will benefit from using the logging module to create an easy-to-read permanent log. Also, when we have programs with multiple modules, and need more sophisticated debugging, we’ll find a need for the logging module.

There are several closely related concepts that define a log.

  1. Your program will have a hierarchical tree of Loggers. Each Logger is used to do two things. It creates LogRecord object with your messages about errors, or debugging information. It provides these LogRecords to Handlers.

    Generally, each major component will have it’s own logger. The various loggers can have separate filter levels so that debugging or warning messages can be selectively enabled or disabled.

  2. Your program will have a small number of Handlers, which are given LogRecords. A Handler can ignore the records, write them to a file or insert them into a database.

    It’s common to have a handler which creates a very detailed log in a persistent file, and a second handler that simply reports errors and exceptions to the system’s stderr file.

  3. Each Handler can make use of a Formatter to provide a nice, readable version of each LogRecord message.

  4. Also, you can build sophisticated Filters if you need to handle complex situations.

The default configuration gives you a single Logger , named "", which uses a StreamHandler configured to write to standard error file, stderr.

Advantages. While the logging module can appear complex, it gives us a number of distinct advatages.

  • Multiple Loggers. We can easily create a large number of separate loggers. This helps us to manage large, complex programs. Each component of the program can have it’s own, indepenent logger.

    We can configure the collection of loggers centrally, however, supporting sophisticated auditing and debugging which is independent of each individual component.

    Also, all the loggers can feed a single, common log file.

    Each logger can also have a severity level filter. This allows us to selectively enable debugging or disable warnings on a logger-by-logger basis.

  • Hierarchy of Loggers. Each Logger instance has a name, which is a .-separated string of names. For example, 'myapp.stock', 'myapp.portfolio'.

    This forms a natural hierarchy of Loggers. Each child inherits the configuration from its parent, which simplifies configuration.

    If, for example, we have a program which does stock portfolio analysis, we might have a component which does stock prices and another component which does overall portfolio calculations. Each component, then, could have a separate Logger which uses component name. Both of these Loggers are children of the "" Logger ; the configuration for the top-most Logger would apply to both children.

    Some components define their own Loggers. For example SQLAlchemy, has a set of Loggers with 'sqlalchemy' as the first part of their name. You can configure all of them by using that top-level name. For specific debugging, you might alter the configuration of just one Logger, for example, 'sqlalchemy.orm.sync'.

  • Multiple Handlers. Each Logger can feed a number of Handlers. This allows you to assure that a single important log messages can go to multiple destinations. A common setup is to have two Handlers for log messages: a FileHandler which records everything, and a StreamHandler which writes only severe error messages to stderr.

    For some kinds of applications, you may also want to add the SysLogHandler (in conjunction with a Filter) to send some messages to the operating system-maintained system log as well as the application’s internal log.

    Another example is using the SMTPHandler to send selected log messages via email as well as to the application’s log and stderr.

  • Level Numbers and Filters. Each LogRecord includes a message level number, and a destination Logger name (as well as the text of the message and arguments with values to insert into the message). There are a number of predefined level numbers which are used for filtering. Additionally, a Filter object can be created to filter by destination Logger name, or any other criteria.

    The predefined levels are CRITICAL, ERROR, WARNING, INFO, and DEBUG. These are coded with numeric values from 50 to 10.

    Critical messages usually indicate a complete failure of the application, they are often the last message sent before it stops running; error messages indicate problems which are not fatal, but preclude the creation of usable results; warnings are questions or notes about the results being produced. The information messages are the standard messages to describe successful processing, and debug messages provide additional details.

    By default, all Loggers will show only messages which have a level number greater than or equal to WARNING, which is generally 30. When enabling debugging, we rarely want to debug an entire application. Instead, we usually enable debugging on specific modules. We do this by changing the level of a specific Logger.

    You can create additional level numbers or change the level numbers. Programmers familiar with Java, for example, might want to change the levels to SEVERE, WARNING, INFO, CONFIG, FINE, FINER, FINEST, using level numbers from 70 through 10.

Module-Level Functions. The following module-level functions will get a Logger that can be used for logging. Additionally, there are functions can also be used to create Handlers, Filters and Formatters that can be used to configure a Logger.

logging.getLogger(name) → Logger
Returns a Logger with the given name. The name is a .-separated string of names (e.g., "x.y.z" ) If the Logger already exists, it is returned. If the Logger did not exist, it is created and returned.
logging.addLevelName(level, name)
Defines (or redefines) a level number, proving a name that will be displayed for the given level number Generally, you will parallel these definitions with your own constants. For example, CONFIG=20; logging.addLevelName(CONFIG,"CONFIG")
logging.basicConfig(...)

Configures the logging system. By default this creates a StreamHandler directed to stderr, and a default Formatter. Also, by default, all Loggers show only WARNING or higher messages. There are a number of keyword parameters that can be given to basicConfig().

Parameters:
  • filename – This keyword provides the filename used to create a FileHandler instead of a StreamHandler. The log will be written to the given file.
  • filemode – If a filename is given, this is the mode to open the file. By default, a file is opened with 'a', appending the log file.
  • format – This is the format string for the Handler that is created. A Formatter object has a format() method which expects a dictionary of values; the format string uses "%(key)s" conversion specifications. See String Formatting with Dictionaries for more information. The dictionary provided to a Formatter is the LogRecord , which has a number of fields that can be interpolated into a log string.
  • datefmt – The date/time format to use for the asctime attribute of a LogRecord. This is a format string based on the time package time.strftime() function. See Dates and Times: the time and datetime Modules for more information on this format string.
  • level – This is the default message level for all loggers. The default is WARNING, 30. Messages with a lower level (i.e., INFO and DEBUG) are not show.
  • stream – This is a stream that will be used to initialize a StreamHandler instead of a FileHandler. This is incompatible with filename. If both filename and stream are provided, stream is ignored.

Typically, you’ll use this in the following form: logging.basicConfig( level=logging.INFO ).

logging.fileConfig(file)
Configures the logging system. This will read a configuration file, which defines the loggers, handlers and formatters that will be built initially. Once the loggers are built by the configuration, then the logging.getLogger() function will return one of these pre-built loggers.
logging.shutdown()
Finishes logging by flushing all buffers and closing all handlers, which generally closes any internally created files and streams. An application must do this last to assure that all log messages are properly recorded in the log.

Logger Method Functions. The following functions are used to create a LogRecord in a Logger; a LogRecord is then processed by the Handlers associated with the Logger.

Many of these functions have essentially the same signature. They accept the text for a message as the first argument. This message can have string conversion specifications, which are filled in from the various arguments. In effect, the logger does message % ( args ) for you.

You can provide a number of argument values, or you can provide a single argument which is a dictionary. This gives us two principle methods for producing log messages.

  • log.info( "message %s, %d", "some string", 2 )
  • log.info( "message %(part1)s, %(anotherpart)d", "part1" : "some string", "anotherpart": 2 )

These functions also have an optional argument, exc_info , which can have either of two values. You can provide the keyword argument exc_info= sys.exc_info(). As an alternative, you can provide exc_info=True, in which case the logging module will call sys.exc_info() for you.

Logger.debug(message, args, ...)
Creates a LogRecord with level DEBUG, then processes this LogRecord on this Logger. The message is the message text; the args are the arguments which are provided to the formatting operator, %.
Logger.info(message, args, ...)
Creates a LogRecord with level INFO on this logger. The positional arguments fill in the message; a single positional argument can be a dictionary.
Logger.warning(message, args, ...)
Creates a LogRecord with level WARNING on this logger. The positional arguments fill in the message; a single positional argument can be a dictionary.
Logger.error(message, args, ...)
Creates a LogRecord with level ERROR on this logger. The positional arguments fill in the message; a single positional argument can be a dictionary.
Logger.critical(message, args, ...)
Creates a LogRecord with level CRITICAL on this logger. The positional arguments fill in the message; a single positional argument can be a dictionary.
Logger.log(level, message, args, ...)
Creates a LogRecord with the given lvl on this logger. The positional arguments fill in the message; a single positional argument can be a dictionary. The exc_info keyword argument can provide exception information.
Logger.exception(message, args, ...)

Creates a LogRecord with level ERROR on this logger. The positional arguments fill in the message; a single positional argument can be a dictionary.

Exception info is added to the logging message, as if the keyword parameter exc_info=True. This method should only be called from an exception handler.

Logger.isEnabledFor(level) → bool

Returns True if this Logger will handle messages of this level or higher. This can be handy to prevent creating really complex debugging output that would only get ignored by the logger. This is rarely needed, and is used in the following structure:

if log.isEnabledFor(logging.DEBUG):
    log.debug( "some complex message" )

The following method functions are used to configure a Logger. Generally, you’ll configure Loggers using the module level basicConfig() and fileConfig() functions. However, in some specialized circumstances (like unit testing), you may want finer control without the overhead of a configuration file.

Logger.propagte

When set to True, all the parents of a given Logger must also handle the message. This assures consistency for audit purposes.

When False, the parents will not handle the message. A False value might be used for keeping debugging messages separate from other messages. By default this is a True value.

Logger.setLevel(level)
Sets the level for this Logger ; messages less severe are ignored. Messages of this severity or higher are handled. The special value of logging.NOTSET indicates that this Logger inherits the setting from the parent. The root logger has a default value of logging.WARNING.
Logger.getEffectiveLevel() → level
ets the level for this Logger. If this Logger has a setting of logging.NOTSET (the default for all Loggers ) then it inherits the level from its parent.
Logger.addFilter(filter)
Adds the given Filter object to this Logger.
Logger.removeFilter(filter)
Removes the given Filter object from this Logger.
Logger.addHandler(handler)
Adds the given Handler object to this Logger.
Logger.removeHandler(handler)
Removes the given Handler object from this Logger.

There are also some functions which would be used if you were creating your own subclass of Logger for more specialized logging purposes. These methods include log.filter(), log.handle() and log.findCaller().

Using a Logger. Generally, there are a number of ways of using a Logger. In a module that is part of a larger application, we will get an instance of a Logger, and trust that it was configured correctly by the overall application. In the top-level application we may both configure and use a Logger.

This example shows a simple module file which uses a Logger.

logmodule.py

import logging, sys

logger= logging.getLogger(__name__)

def someFunc( a, b ):
    logger.debug( "someFunc( %d, %d )", a, b )
    try:
        return 2*int(a) + int(b)
    except ValueError, e:
        logger.warning( "ValueError in someFunc( %r, %r )", a, b, exc_info=True )

def mainFunc( *args ):
    logger.info( "Starting mainFunc" )
    z= someFunc( args[0], args[1] )
    print z
    logger.info( "Ending mainFunc" )

if __name__ == "__main__":
    logging.fileConfig( "logmodule_log.ini" )
    mainFunc( sys.argv[1:] )
    logging.shutdown()
  1. We import the logging module and the sys module.

  2. We ask the logging module to create a Logger with the given name. We use the Python assigned __name__ name. This work well for all imported library modules and packages.

    We do this through a factory function to assure that the logger is configured correctly. The logging module actually keeps a pool of Loggers, and will assure that there is only one instance of each named Logger.

  3. This function has a debugging message and a warning message. This is typical of most function definitions. Ordinarily, the debug message will not show up in the log; we can only see it if we provide a configuration which sets the log level to DEBUG for the root logger or the logmodule Logger.

  4. This function has a pair of informational messages. This is typical of “main” functions which drive an overall application program. Applications which have several logical steps might have informational messages for each step. Since informational messages are lower level than warnings, these don’t show up by default; however, the main program that uses this module will often set the overall level to logging.INFO to enable informational messages.

File Format Exercises

  1. Create An Office Suite Result. Back in Iteration Exercises we used the for statement to produce tabular displays of data in a number of exercises. This included “How Much Effort to Produce Software?”, “Wind Chill Table”, “Celsius to Fahrenheit Conversion Tables” and “Dive Planning Table”. Update one of these programs to produce a CSV file. If you have a desktop office suite, be sure to load the CSV file into a spreadsheet program to be sure it looks correct.

  2. Proper File Parsing. Back in File Module Exercises we built a quick and dirty CSV parser. Fix these programs to use the CSV module properly.

  3. Configuration Processing. In Stock Valuation, we looked at a program which processed blocks of stock. One of the specific programs was an analysis report which showed the value of the portfolio on a given date at a given price. We make this program more flexible by having it read a configuration file with the current date and stock prices.

  4. Office Suite Extraction. Most office suite software can save files in XML format as well as their own proprietary format. The XML is complex, but you can examine it in pieces using Python programs. It helps to work with highly structured data, like an XML version of a spreadsheet. For example, your spreadsheet may use tags like <Table>, <Row> and <Cell> to organize the content of the spreadsheet.

    First, write a simple program to show the top-level elements of the document. It often helps to show the text within those elements so that you can correlate the XML structure with the original document contents.

    Once you can display the top-level elements, you can focus on the elements that have meaningful data. For example, if you are parsing spreadsheet XML, you can assembled the values of all of the <Cell>‘s in a <Row> into a proper row of data, perhaps using a simple Python list.

The DOM Class Hierarchy

This is some supplemental information on the xml.dom and xml.minidom object models for XML documents.

class Node

The Node class is the superclass for all of the various DOM classes. It defines a number of attributes and methods which are common to all of the various subclasses. This class should be thought of as abstract: it is not used directly; it exists to provide common features to all of the subclasses.

Here are the attributes which are common to all of the various kinds of Node objects.

nodeType

This is an integer code that discriminates among the subclasses of Node. There are a number of helpful symbolic constants which are class variables in xml.dom.Node. These constants define the various types of Nodes.

ELEMENT_NODE, ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, NOTATION_NODE.

attributes
This is a map-like collection of attributes. It is an instance of xml.dom.NamedNodeMap . It has method functions including get() , getNamedItem() , getNamedItemNS() , has_key() , item() , items() , itemsNS() , keys() , keysNS() , length() , removeNamedItem() , removeNamedItemNS() , setNamedItem() , setNamedItemNS() , values() . The item() and length() methods are defined by the standard and provided for Java compatibility.
localName
If there is a namespace, then this is the portion of the name after the colon. If there is no namespace, this is the entire tag name.
prefix
If there is a namespace, then this is the portion of the name before the colon. If there is no namespace, this is an empty string.
namespaceURI
If there is a namespace, this is the URI for that namespace. If there is no namespace, this is None .
parentNode
This is the parent of this Node. The Document Node will have None for this attribute, since it is the parent of all Nodes in the document. For all other Node s, this is the context in which the Node appears.
previousSibling
Sibling Nodes share a common parent. This attribute of a Node is the Node which precedes it within a parent. If this is the first Node under a parent, the previousSibling will be None . Often, the preceeding Node will be a Text containing whitespace.
nextSibling
Sibling Nodes share a common parent. This attribute of a Node is the Node which follows it within a parent. If this is the last Node under a parent, the nextSibling will be None . Often, the following Node will be Text containing whitespace.
childNodes
The list of child Nodes under this Node. Generally, this will be a xml.dom.NodeList instance, not a simple Python list . A NodeList behaves like a list , but has two extra methods: item() and length() , which are defined by the standard and provided for Java compatibility.
firstChild
The first Node in the childNodes list, similar to childNodes[:1]. It will be None if the childNodes list is also empty.
lastChild
The last Node in the childNodes list, similar to childNodes[-1:]. It will be None if the childNodes list is also empty.

Here are some attributes which are overridden in each subclass of Node. They have slightly different meanings for each node type.

nodeName
A string with the “name” for this Node. For an Element, this will be the same as the tagName attribute. In some cases, it will be None.
nodeValue
A string with the “value” for this Node. For an Text , this will be the same as the data attribute. In some cases, it will be None.

Here are some methods of a Node.

hasAttributes() → bool
This function returns True if there are attributes associated with this Node.
hasChildNodes() → bool
This function returns True if there child Node s associated with this Node.
class Document(Node)

This is the top-level document, the object returned by the parser. It is a subclass of Node, so it inherits all of those attributes and methods. The Document class adds some attributes and method functions to the Node definition.

documentElement
This attribute refers to the top-most Element in the XML document. A Document may contain DocumentType, ProcessingInstruction and Comment Nodes, also. This attribute saves you having to dig through the childNodes list for the top Element.
getElementsByTagName(tagName) → list
This function returns a NodeList with each Element in this Document that has the given tag name.
getElementsByTagNameNS(namespaceURI, tagName) → list
This function returns a NodeList with each Element in this Document that has the given namespace URI and local tag name.
class Element(Node)

This is a specific element within an XML document. An element is surrounded by XML tags. In <para id="sample">Text</para>, the tag is <para>, which provides the name for the Element. Most Elements will have children, some will have Attributes as well as children. The Element class adds some attributes and method functions to the Node definition.

tagName
The full name for the tag. If there is a namesace, this will be the complete name, including colons. This will also be in nodeValue .
getElementsByTagName(tagName) → list
This function returns a NodeList with each Element in this Element that has the given tag name.
getElementsByTagNameNS(namespaceURI, tagName) → list
This function returns a NodeList with each Element in this Element that has the given namespace URI and local tag name.
hasAttribute(name) → bool
Returns True if this Element has an Attr with the given name.
hasAttribute(namespaceURI, localName) → bool
Returns True if this Element has an Attr with the given name based on the namespace and localName.
getAttribute(name) → string
Returns the string value of the Attr with the given name. If the attribute doesn’t exist, this will return a zero-length string.
getAttributeNS(namespaceURI, localName) → string
Returns the string value of the Attr with the given name. If the attribute doesn’t exist, this will return a zero-length string.
getAttributeNode(name) → Node
Returns the Attr with the given name. If the named attribute doesn’t exist, this method returns None.
getAttributeNodeNS(namespaceURI, localName) → Node
Returns the Attr with the given name. If the named attribute doesn’t exist, this method returns None.
class Attr(Node)

This is an attribute, within an Element. In <para id="sample">Text</para>, the tag is <para>; this tag has an attribute of id with a value of sample . Generally, the nodeType, nodeName and nodeValue attributes are all that are used. The Attr class adds some attributes to the Node definition.

name
The full name of the attribute, which may include colons. The Node class defines localName, prefix and namespaceURI which may be necessary for correctly processing this attribute.
value
The string value of the attribute. Also note that nodeValue will have a copy of the attribute’s value.
class Text(Node)
class CDATASection(Node)

This is the text within an element. In <para id="sample">Text</para> , the text is Text . Note that end of line characters and indentation also count as Text nodes. Further, the parser may break up a large piece of text into a number of smaller Text nodes. The Text class adds an attribute to the Node definition.

data
The text. Also note that nodeValue will have a copy of the text.
class Comment(Node)

This is the text within a comment. The <!-- and --> characters are not included. The Comment class adds an attribute to the Node definition.

data
The comment. Also note that nodeValue will have a copy of the comment.

Table Of Contents

Previous topic

File Handling Modules

Next topic

Programs: Standing Alone

This Page