File Handling Modules

There are a number of operations closely related to file processing. Deleting and renaming files are examples of operations that change the directory information that the operating system maintains to describe a file. Python provides numerous modules for these operating system operations.

We can’t begin to cover all of the various ways in which Python supports file handling. However, we can identify the essential modules that may help you avoid reinventing the wheel. Further, these modules can provide you a view of the Pythonic way of working with data from files.

The following modules have features that are essential for supporting file processing. We’ll cover selected features of each module that are directly relevant to file processing. We’ll present these in the order you’d find them in the Python library documentation.

Chapter 11 – File and Directory Access. Chapter 11 of the Library reference covers many modules which are essential for reliable use of files and directories. We’ll look closely at the following modules.

os.path

This module has functions for numerous common pathname manipulations. Use these functions to split and join full directory path names. This is operating-system neutral, with a correct implementation for all operating systems.

One of the most obvious differences among operating systems is the way that files are named. In particular, the path separator can be either the POSIX standard /, or the windows \. Rather than make each program aware of the operating system rules for path construction, Python provides the os.path module to make all of the common filename manipulations completely consistent.

fileinput
This module has functions which will iterate over lines from multiple input streams. This allows you to write a single, simple loop that processes lines from any number of input files.
tempfile
This module has a class and functions to generate temporary files and temporary file names. This is the most secure way to generate a temporary file.
glob
This module provides shell style pathname pattern expansion. Unix shells translate name patterns like *.py into a list of files. This is called globbing. The glob module implements this within Python, which allows this feature to work even in Windows where it isn’t supported by the OS itself.
fnmatch
This module provides UNIX shell style filename pattern matching. This implements the glob-style rules using *, ? and []. * matches any number of characters, ? matches any single character, [chars]. encloses a list of allowed characters, [!chars] encloses a list of disallowed characters.
shutil
This module has usefule, high-level file operations, including copy, rename and remove. These are the kinds of things that the shell handles with simple commands like cp or rm. This module makes these features available to a Python program or script.

Chapter 12 – Data Persistence. There are several issues related to making objects persistent. We’ll look at these modules in detail in File Formats: CSV, Tab, XML, Logs and Others.

pickle shelve
The pickle and shelve modules are used to create persistent objects; objects that persist beyond the one-time execution of a Python program. The pickle module produces a serial text representation of any object, however complex; this can reconstitute an object from its text representation. The shelve module uses a dbm database to store and retrieve objects. The shelve module is not a complete object-oriented database, as it lacks any transaction management capabilities.
sqlite3
This module provides access to the SQLite relational database. This database provides a significant subset of SQL language features, allowing us to build a relational database that’s compatible with products like MySQL or Postgres.

Chapter 13 – Data Compression and Archiving. Data Compression is covered in Chapter 12 of the Library referece. We’ll look closely at the following modules.

tarfile zipfile
These modules helps you read and write archive files; files which are an archive of a complex directory structure. This includes GNU/Linux tape archive (.tar) files, compressed GZip tar files (.tgz files or .tar.gz files) sometimes called tarballs, and ZIP files.
zlib gzip bz2
These modules are all variations on a common theme of reading and writing files which are compressed to remove redundant bytes of data. The zlib and bz2 modules have a more sophisticated interface, allowing you to use compression selectively within a more complex application. The gzip module has a different (and simpler) interface that only applies only to complete files.

Chapter 14 – File Formats. These are modules for reading and writing files in a few of the amazing variety of file formats that are in common use. We’ll focus on just a few.

csv
The csv module helps you parse and create Comma-Separated Value (CSV) data files. This helps you exchange data with many desktop tools that produce or consume CSV files. We’ll look at this in Comma-Separated Values: The csv Module.

Chapter 16 - Generic Operating System Services. The following modules contain basic features that are common to all operating systems

os
Miscellaneous OS interfaces. This includes parameters of the current process, additional file object creation, manipulations of file descriptors, managing directories and files, managing subprocesses, and additional details about the current operating system.
time
The time module provides basic functions for time and date processing. Additionally datetime handles details of the calendar more gracefully than time does. We’ll cover both modules in detail in Dates and Times: the time and datetime Modules.
logging
Often, you want a simple, standardized log for errors as well as debugging information. We’ll look at logging in detail in Log Files: The logging Module.

Chapter 28 - Python Runtime Services. These modules described in Chapter 26 of the Library reference include some that are used for handling various kinds of files. We’ll look closely as just one.

sys
This module has several system-specific parameters and functions, including definitions of the three standard files that are available to every program.

The os.path Module

The os.path module contains more useful functions for managing path and directory names. A serious mistake is to use ordinary string functions with literal stringsfor the path separators. A Windows program using \ as the separator won’t work anywhere else. A less serious mistake is to use os.pathsep instead of the routines in the os.path module.

The os.path module contains the following functions for completely portable path and filename manipulation.

os.path.basename(path) → name
Return the base filename, the second half of the result created by os.path.split( path ).
os.path.dirname(path) → name
Return the directory name, the first half of the result created by os.path.split( path ).
os.path.exists(path) → boolean
Return True if the pathname refers to an existing file or directory.
os.path.getatime(path) → time
Return the last access time of a file, reported by os.stat(). See the time module for functions to process the time value.
os.path.getmtime(path) → time
Return the last modification time of a file, reported by os.stat(). See the time module for functions to process the time value.
os.path.getsize(path) → long
Return the size of a file, in bytes, reported by os.stat().
os.path.isdir(path) → boolean
Return True if the pathname refers to an existing directory.
os.path.isfile(path) → boolean
Return True if the pathname refers to an existing regular file.
os.path.join(string, ...) → path
Join path components using the appropriate path separator.
os.path.split(path) → tuple
Split a path into two parts: the directory and the basename (the filename, without path separators, in that directory). The result (s, t) is such that os.path.join( s, t ) yields the original path.
os.path.splitdrive(path) → tuple
Split a pathname into a drive specification and the rest of the path. Useful on DOS/Windows/NT.
os.path.splitext(path) → tuple
Split a path into root and extension. The extension is everything starting at the last dot in the last component of the pathname; the root is everything before that. The result (r, e) is such that r+e yields the original path.

The following example is typical of the manipulations done with os.path.

import sys, os.path
def process( oldName, newName ):
    Some Processing...

for oldFile in sys.argv[1:]:
    dir, fileext= os.path.split(oldFile)
    file, ext= os.path.splitext( fileext )
    if ext.upper() == '.RST':
        newFile= os.path.join( dir, file ) + '.HTML'
        print oldFile, '->', newFile
        process( oldFile, newFile )
  1. This program imports the sys and os.path modules.
  2. The process() function does something interesting and useful to the input file. It is the real heart of the program.
  3. The for statement sets the variable oldFile to each string (after the first) in the sequence sys.argv.
  4. Each file name is split into the path name and the base name. The base name is further split to separate the file name from the extension. The os.path does this correctly for all operating systems, saving us having to write platform-specific code. For example, splitext() correctly handles the situation where a linux file has multiple ‘.’s in the file name.
  5. The extension is tested to be ‘.RST’. A new file name is created from the path, base name and a new extension (‘.HTML’). The old and new file names are printed and some processing, defined in the process(), uses the oldFile and newFile names.

The os Module

The os module contains an interface to many operating system-specific functions to manipulate processes, files, file descriptors, directories and other “low level” features of the OS. Programs that import and use os stand a better chance of being portable between different platforms. Portable programs must depend only on functions that are supported for all platforms (e.g., unlink() and opendir()), and perform all pathname manipulation with os.path.

The os module exports the following variables that characterize your operating system.

os.name
A name for the operating system, for example 'posix', 'nt', 'dos', 'os2', 'mac', or 'ce'. Note that Mac OS X has an os.name of 'posix'; but sys.platform is 'darwin'. Windows XP has an os.name of 'nt'.
os.curdir
The string which represents the current directory ('.' , generally).
os.pardir
The string which represents the parent directory ('..' , generally).
os.sep
os.altsep

The (or a most common) pathname separator ('/' or ':' or '\') and the alternate pathname separator (None or '/'). Most of the Python library routines will translate '/' to the correct value for the operating system (typically, '\' on Windows.)

It is best to always use os.path rather than these low-level constants.

os.pathsep
The component separator used in the OS environment variable $PATH (':' is the standard, ';' is used for Windows).
os.linesep
The line separator in text files ('n' is the standard; 'rn' is used for Windowss). This is already part of the readlines() function.
os.defpath
The default search path for executables. The standard is ':/bin:/usr/bin' or '.;C:\bin' for Windows.

The os module has a large number of functions. Many of these are not directly related to file manipulation. However, a few are commonly used to create and remove files and directories. Beyond these basic manipulations, the shutil module supports a variety of file copy operations.

os.chdir(path)
Change the current working directory to the given path. This is the directory which the OS uses to transform a relative file name into an absolute file name.
os.getcwd() → path
Returns the path to the current working directory. This is the directory which the OS use to transform a relative file name into an absolute file name.
os.listdir(path) → list
Returns a list of all entries in the given directory.
os.mkdir(path[, mode])
Create the given directory. In GU/Linux, the mode can be given to specify the permissions; usually this is an octal number. If not provided, the default of 0777 is used, after being updated by the OS umask value.
os.rename(source, destination)
Rename the source filename to the destination filename. There are a number of errors that can occur if the source file doesn’t exist or the destination file already exists or if the two paths are on different devices. Each OS handles the situations slightly differently.
os.remove(file)
Remove (also known as delete or unlink) the file. If you attempt to remove a directory, this will raise OSError. If the file is in use, the standard behavior is to remove the file when it is finally closed; Windows, however, will raise an exception.
os.rmdir(path)
Remove (also known as delete or unlink) the directory. if you attempt to remove an ordinary file, this will raise OSError.

Here’s a short example showing some of the functions in the os module.

>>> import os
>>> os.chdir("/Users/slott")
>>> os.getcwd()
'/Users/slott'
>>> os.listdir(os.getcwd())
['.bash_history', '.bash_profile',
'.bash_profile.pysave', '.DS_Store',
'.filezilla', '.fonts.cache-1', '.fop', '.idlerc', '.isql.history',
'.lesshst', '.login_readpw',
'.monitor', '.mozilla', '.sh_history', '.sqlite_history',
'.ssh', '.subversion', '.texlive2008', '.Trash', '.viminfo', '.wxPyDemo',
'.xxe', 'Applications', 'argo.user.properties', 'Burn Folder.fpbf',
'Desktop', 'Documents', 'Downloads', 'Library', 'Movies',
'Music', 'Pictures', 'Public', 'Sites']

The fileinput Module

The fileinput module interacts with sys.argv. The fileinput.input() function opens files based on all the values of sys.argv[1:]. It carefully skips sys.argv[0], which is the name of the Python script file. For each file, it reads all of the lines as text, allowing a program to read and process multiple files, like many standard Unix utilities.

The typical use case is:

import fileinput
for line in fileinput.input():
    process(line)

This iterates over the lines of all files listed in sys.argv[1:], with a default of sys.stdin if the list is empty. If a filename is - it is also replaced by sys.stdin at that position in the list of files. To specify an alternative list of filenames, pass it as the argument to input(). A single file name is also allowed in addition to a list of file names.

While processing input, several functions are available in the fileinput module:

fileinput.input() → string
Iterator over all lines from the cumulative collection of files.
fileinput.filename() → string
The filename of the line that has just been read.
fileinput.lineno() → int
The cumulative line number of the line that has just been read.
fileinput.filelineno() → int
The line number in the current file.
fileinput.isfirstline() → boolean
True if the line just read is the first line of its file.
fileinput.isstdin() → boolean
True if the line was read from sys.stdin.
fileinput.nextfile()
Close the current file so that the next iteration will read the first line from the next file (if any); lines not read from the file will not count towards the cumulative line count; the filename is not changed until after the first line of the next file has been read.
fileinput.close()
Closes the iterator.

All files are opened in text mode. If an I/O error occurs during opening or reading a file, the IOError exception is raised.

Example. We can use fileinput to write a Python version of the common Unix utility, grep. The grep utility searches a list of files for a given pattern.

For non-Unix users, the grep utility looks for the given regular expression in any number of files. The name grep is an acronym of Global Regular Expression Print. This is similar to the Windows command find.

greppy.py

#!/usr/bin/env python
import sys
import re
import fileinput
pattern= re.compile( sys.argv[1] )
for line in fileinput.input(sys.argv[2:]):
    if pattern.match( line ):
        print fileinput.filename(), fileinput.filelineno(), line
  1. The sys module provides access to the command-line parameters. The re module provides the pattern matching. The fileinput module makes searching an arbitrary list of files simple. For more information on re, see Complex Strings: the re Module.

  2. The first command line argument ( sys.argv[0] ) is the name of the script, which this program ignores.

    The second command-line argument is the pattern that defines the target of the search.

    The remaining command-line arguments are given to fileinput.input() so that all the named files will be examined.

  3. The pattern regular expression is matched against each individual input line.

    If match() returns None, the line did not match. If match() returns an object, the program prints the current file name, the current line number of the file and the actual input line that matched.

After we do a chmod +x greppy.py, we can use this program as follows. Note that we have to provide quotes to prevent the shell from doing globbing on our pattern string.

$ greppy.py 'import.*random' *.py
demorandom.py 2 import random
dice.py 1 import random
functions.py 2 import random

The glob and fnmatch Modules

The glob module adds a necessary Unix shell capability to Windows programmers. The glob module includes the following function

glob.glob(wildcard) → list
Return a list of filenames that match the given wild-card pattern. The fnmatch module is used for the wild-card pattern matching.

A common use for glob is something like the following under Windows.

import glob, sys
for arg in sys.argv[1:]:
    for f in glob.glob(arg):
        process( f )

This makes Windows programs process command line arguments somewhat like Unix programs. Each argument is passed to glob.glob() to expand any patterns into a list of matching files. If the argument is not a wild-card pattern, glob simply returns a list containing this one file name.

The fnmatch module has the algorithm for actually matching a wild-card pattern against specific file names. This module implements the Unix shell wild-card rules. These are not the same as the more sophisticated regular expression rules. The module contains the following function:

fnmatch(filename, pattern) → boolean
Return True if the filename matches the pattern.

The patterns use * to match any number of characters, ? to match any single character. [letters] matches any of these letters, and [!letters] matches any letter that is not in the given set of letters.

>>> import fnmatch
>>> fnmatch.fnmatch('greppy.py','*.py')
True
>>> fnmatch.fnmatch('README','*.py')
False

The tempfile Module

One common problem is to open a unique temporary file to hold intermediate results. The tempfile module provides us a handy way to create temporary files that we can write to and read from.

The tempfile module creates a temporary file in the most secure and reliable manner. The tempfile module includes an internal function, mkstemp() which dioes the hard work of creating a unique temporary file.

tempfile.TemporaryFile(mode='w+b', bufsize=-1, suffix='', prefix='tmp', dir=None) → file

This function creates a file which is automatically deleted when it is closed. All of the parameters are optional. By default, the mode is 'w+b': write with read in binary mode.

The bufsize parameter has the same meaning as it does for the built-in open() function.

The keyword parameters suffix, prefix and dir provide some structure to the name assigned to the file. The suffix should include the dot, for example suffix='.tmp'.

tempfile.NamedTemporaryFile(mode='w+b', bufsize=-1, suffix='', prefix='tmp', dir=None, delete=True) → file

This function is similar to TemporaryFile() ; it creates a file which can be automatically deleted when it is closed. The temporary file, however, is guaranteed to be visible on the file system while the file is open.

The parameters are the same as a tempfile.TemporaryFile().

If the delete parameter is False, the file is not automatically deleted.

tempfile.mkstemp(suffix='', prefix='tmp', dir=None, text=False) -> ( fd, name)
This function does the essential job of creating a temporary file. It returns the interanl file descriptor as well as the name of the file. The file is not automatically deleted. If necessary, the file created by this function can be explicitly deleted with os.remove().
import tempfile, os
fd, tempName = tempfile.mkstemp( '.d1' )
temp= open( tempName, 'w+' )
Some Processing...

This fragment will create a unique temporary file name with an extension of .d1. Since the name is guaranteed to be unique, this can be used without fear of damaging or overwriting any other file.

The shutil Module

The shutil module helps you automate copying files and directories. This saves the steps of opening, reading, writing and closing files when there is no actual processing, simply moving files.

shutil.copy(source, destination)
Copy data and mode bits; basically the unix command cp src dst. If destination is a directory, a file with the same base name as source is created. If destination is a full file name, this is the destination file.
shutil.copyfile(source, destination)
Copy data from source to destination. Both names must be files.
shutil.copytree(source, destination)
Recursively copy the entire directory tree rooted at source to destination. destination must not already exist. Errors are reported to standard output.
shutil.rmtree(path)
Recursively delete a directory tree rooted at path.

Note that removing a file is done with os.remove() (or os.unlink()).

This module allows us to build Python applications that are like shell scripts. There are a lot of advantages to writing Python programs rather than shell scripts to automate mundane tasks.

First, Python programs are easier to read than shell scripts. This is because the language did not evolve in way that emphasized tersness; the shell script languages use a minimum of punctuation, which make them hard to read.

Second, Python programs have a more sophisticated programming model, with class definitions, and numerous sophisticated data structures. The shell works with simple argument lists; it has to resort to running the test or expr programs to process string s or numbers.

Finally, Python programs have direct access to more of the operating system’s features than the shell. Generally, many of the basic GNU/Linux API calls are provided via innumerable small programs. Rather than having the shell run a small program to make an API call, Python can simply make the API call.

The File Archive Modules: tarfile and zipfile

An archive file contains a complex, hierarchical file directory in a single sequential file. The archive file includes the original directory information as well as a the contents of all of the files in those directories. There are a number of archive file formats, Python directory supports two: tar and zip archives.

The tar (Tape Archive) format is widely used in the GNU/Linux world to distribute files. It is a POSIX standard, making it usable on a wide variety of operating systems. A tar file can also be compressed, often with the GZip utility, leading to .tgz or .tar.gz files which are compressed archives.

The Zip file format was invented by Phil Katz at PKWare as a way to archive a complex, hierarchical file directory into a compact sequential file. The Zip format is widely used but is not a POSIX standard. Zip file processing includes a choice of compression algorithms; the exact algorithm used is encoded in the header of the file, not in the name of file.

Creating a TarFile or a ZipFile. Since an archive file is still – essentially – a single file, it is opened with a variation on the open() function. Since an archive file contains directory and file contents, it has a number of methods above and beyond what a simple file has.

tarfile.open(name=None, mode='r', fileobj=None, bufsize=10240) → TarFile

This module-level function opens the given tar file for processing. The name is a file name string; it is optional because the fileobj can be used instead. The fileobject is a conventional file object, which can be used instead of the name; it can be a standard file like sys.stdin. The buffersize is like the built-in open() function.

The mode is similar to the built-in open() function; it has numerous additional characters to specify the compression algorithms, if any.

zipfile.ZipFile(file, mode="r", compression=ZIP_STORED, allowZip64=False) → ZipFile

This class constructor opens the given zip file for processing. The name is a file name string. The mode is similar to the built-in open() function. The compression is the compression code. It can be zipfile.ZIP_STORED or zipfile.ZIP_DEFLATED. A compression of ZIP_STORED uses no compression; a value of ZIP_DEFLATED uses the Zlib compression algorithms.

The allowZip64 option is used when creating new, empty Zip Files. If this is set to True, then this will create files with the ZIP64 extensions. If this is left at False, any time a ZIP64 extension would be required will raise an exception.

The open function can be used to read or write the archive file. It can be used to process a simple disk file, using the filename. Or, more importantly, it can be used to process a non-disk file: this includes tape devices and network sockets. In the non-disk case, a file object is given to tarfile.open().

For tar files, the mode information is rather complex because we can do more than simply read, write and append. The mode string adresses three issues: the kind of opening (reading, writing, appending), the kind of access (block or stream) and the kind of compression.

For zip files, however, the mode is simply the kind of opening that is done.

Opening - Both zip and tar files. A zip or tar file can be opened in any of three modes.

r
Open the file for reading only.
w
Open the file for writing only.
a
Open an existing file for appending.

Access - tar files only. A tar file can have either of two fundamentally different kinds of access. If a tar file is a disk file, which supports seek and tell operations, then you we access the tar file in block mode. If the tar file is a stream, network connection or a pipeline, which does not support seek or tell operations, then we must access the archive in stream mode.

:
Block mode. The tar file is an disk file, and seek and tell operations are supported. This is the assumed default, if neither : or | are specified.
|
Stream mode. The tar file is a stream, socket or pipeline, and cannot respond to seek or tell operations. Note that you cannot append to a stream, so the 'a|' combination is illegal.

This access distinction isn’t meaningful for zip files.

Compression - tar files only. A tar file may be compressed with GZip or BZip2 algorithms, or it may be uncompressed. Generally, you only need to select compression when writing. It doesn’t make sense to attempt to select compression when appending to an existing file, or when reading a file.

(nothing)
The tar file will not be compressed.
gz
The tar file will be compressed with GZip.
bz2
The tar file will be compressed with BZip2.

This compression distinction isn’t meaningful for zip files. Zip file compression is specified in the zipfile.ZipFile constructor.

Tar File Examples. The most common block modes for tar files are r, a, w:, w:gz, w:bz2. Note that read and append modes cannot meaningfully provide compression information, since it’s obvious from the file if it was compressed, and which algorithm was used.

For stream modes, however, the compression information must be provided. The modes include all six combinations: r|, r|gz, r|bz2, w|, w|gz, w|bz2.

Directory Information. Each individual file in a tar archive is described with a TarInfo object. This has name, size, access mode, ownership and other OS information on the file. A number of methods will retrieve member information from an archive.

TarFile.getmember(name) → TarInfo
Reads through the archive index looking for the given member name. Returns a TarInfo object for the named member, or raises a KeyError exception.
TarFile.getmembers() → list of TarInfo
Returns a list of TarInfo objects for all of the members in the archive.
TarFile.next() → TarInfo
Returns a TarInfo object for the next member of the archive.
TarFile.getNames() → list of strings
Returns a list of member names.

Each individual file in a zip archive is described with a ZipInfo object. This has name, size, access mode, ownership and other OS information on the file. A number of methods will retrieve member information from an archive.

ZipFile.getinfo(name) → ZipInfo
Locates information about the given member name. Returns a ZipInfo object for the named member, or raises a KeyError exception.
ZipFile.infolost() → list of ZipInfo
Returns a list of ZipInfo objects for all of the members in the archive.
ZipFile.namelist() → list of strings
Returns a list of member names.

Extracting Files From an Archive. If a tar archive is opened with r, then you can read the archive and extract files from it. The following methods will extract member files from an archive.

TarFile.extract(member[, path])
The member can be either a string member name or a TarInfo for a member. This will extract the file’s contents and reconstruct the original file. If path is given, this is the new location for the file.
TarFile.extractfile(member) → file
The member can be either a string member name or a TarInfo for a member. This will open a simple file for access to this member’s contents. The member access file has only read-oriented methods, limited to read(), readline(), readlines(), seek(), tell().

If a zip archive is opened with r, then you can read the archive and extract the contents of a file from it.

ZipFile.read(member) → string
The member is a string member name. This will extract the member’s contents, decompress them if necessary, and return the bytes that consitute the member.

Creating or Extending an Archive. If a tar archive is opened with w or a, then you can add files to it. The following methods will add member files to an archive.

TarFile.add(name[, arcname, recursive=True])
Adds the file with the given name to the current archive file. If arcname is provided, this is the name the file will have in the archive; this allows you to build an archive which doesn’t reflect the source structure. Generally, directories are expanded; using recursive=False prevents expanding directories.
TarFile.addfile(tarinfo, fileobj)
Creates an entry in the archive. The description comes from the tarinfo, an instance of TarInfo, created with the gettarinfo() function. The fileobj is an open file, from which the content is read. Note that the TarInfo.size field can override the actual size of the file. For a given filename, fn, this might look like the following: tf.addfile( tf.gettarinfo(fn), open(fn,"r") ).
TarFile.close()
Closes the archive. For archives being written or appended, this adds the block of zeroes that defines the end of the file.
TarFile.gettarinfo(name[, arcname, fileobj]) → TarInfo
Creates a TarInfo object for a file based either on name , or the fileobj. If a name is given, this is a local filename. The arcname is the name that will be used in the archive, allowing you to modify local filesystem names. If the fileobj is given, this file is interrogated to gather required information.

If a zip archive is opened with w or a, then you can add files to it.

ZipFile.write(filename[, arcname, compress])
The filename is a string file name. This will read the file, compress it, and write it to the archive. If the arcname is given, this will be the name in the archive; otherwise it will use the original filename. The compress parameter overrides the default compression specified when the ZipFile was created.
ZipFile.writestr(arcname, bytes) → string
The arcname is a string file name or a ZipInfo object that will be used to create a new member in the archive. This will write the given bytes to the archive. The compression used is specified when the ZipFile is created.

A tarfile Example. Here’s an example of a program to examine a tarfile, looking for documentation like .html files or README files. It will provide a list of .html files, and actually show the contents of the README files.

readtar.py

#!/usr/bin/env python
"""Scan a tarfile looking for *.html and a README."""
import tarfile
import fnmatch
archive= tarfile.open( "SQLAlchemy-0.3.5.tar.gz", "r" )
for mem in archive.getmembers():
    if fnmatch.fnmatch( mem.name, "*.html" ):
        print mem.name
    elif fnmatch.fnmatch( mem.name.upper(), "*README*" ):
        print mem.name
        docFile= archive.extractfile( mem )
        print docFile.read()

A zipfile Example. Here’s an example of a program to create a zipfile based on the .xml files in a particular directory.

writezip.py

import zipfile, os, fnmatch
bookDistro= zipfile.ZipFile( 'book.zip', 'w', zipfile.ZIP_DEFLATED )
for nm in os.listdir('..'):
    if fnmatch.fnmatch(nm,'*.xml'):
        full= os.path.join( '..', nm )
        bookDistro.write( full )
bookDistro.close()

The sys Module

The sys module provides access to some objects used or maintained by the interpreter and to functions that interact strongly with the interpreter.

The sys module also provides the three standard files used by Python.

sys.stdin:Standard input file object; used by raw_input() and input(). Also available via sys.stdin.read() and related methods of the file object.
sys.stdout:Standard output file object; used by the print statement. Also available via sys.stdout.write() and related methods of the file object.
sys.stderr:Standard error object; used for error messages, typically unhandled exceptions. Available via sys.stderr.write() and related methods of the file object.

A program can assign another file object to one of these global variables. When you change the file for these globals, this will redirect all of the interpreter’s I/O.

Command-Line Arguments. One important object made available by this module is the variable sys.argv. This variable has the command line arguments used to run this script. For example, if we had a python script called portfolio.py, and executed it with the following command:

python portfolio.py -xvb display.csv

Then the sys.argv list would be ["portfolio.py", "-xvb", "display.csv"]. Sophisticated argument processing is done with the optparse module.

A few other interesting objects in the sys module are the following variables.

sys.version:The version of this interpreter as a string. For example, '2.6.3 (r263:75184, Oct  2 2009, 07:56:03) n[GCC 4.0.1 (Apple Inc. build 5493)]'
sys.version_info:
 Version information as a tuple, for example: (2, 6, 3, 'final', 0).
sys.hexversion:Version information encoded as a single integer. Evaluating hex(sys.hexversion) yields '0x20603f0'. Each byte of the value is version information.
sys.copyright:Copyright notice pertaining to this interpreter.
sys.platform:Platform identifier, for example, 'darwin', 'win32' or 'linux2'.
sys.prefix:File Path prefix used to find the Python library, for example '/usr', '/Library/Frameworks/Python.framework/Versions/2.5' or 'c:\Python25'.

Additional File-Processing Modules

There are several other chapters of the Python Library Reference that cover with even more file formats. We’ll identify them briefly here.

Chapter 7 – Internet Data Handling. Reading and processing files of Internet data types is very common. Internet data types have formal definitions governed by the internet standards, called Requests for Comment (RFC’s). The following modules are for handling Internet data structures. These modules and the related standards are beyond the scope of this book.

email
Helps you handle email MIME attachments.
mailcap
Mailcap file handling.
mailbox
Read various mailbox formats.
mhlib
Manipulate MH mailboxes from Python.
mimetools
Tools for parsing MIME-style message bodies.
mimetypes
Mapping of filename extensions to MIME types.
MimeWriter
Generic MIME file writer.
mimify
Mimification and unmimification of mail messages.
multifile
Support for reading files which contain distinct parts, such as some MIME data.
rfc822
Parse RFC 822 style mail headers.
base64
Encode and decode files using the MIME base64 data.
binhex
Encode and decode files in binhex4 format.
binascii
Tools for converting between binary and various ASCII-encoded binary representations.
quopri
Encode and decode files using the MIME quoted-printable encoding.
uu
Encode and decode files in uuencode format.

Chapter 13 - Data Persistence. Many Python programs will also deal with Python objects that are exported from memory to external files or retrieved from files to memory. Since an external file is more persistent than the volatile working memory of a computer, this process makes an object persistent or retrieves a persistent object. One mechanism for creating a persistent object is called serialization, and is supported by several modules, which are beyond the scope of this book.

pickle
Convert Python objects to streams of bytes and back.
cPickle
Faster version of pickle, but not subclassable.
copy_reg
Register pickle support functions.
shelve
Python object persistence.
marshal
Convert Python objects to streams of bytes and back (with different constraints).

More complex file structures can be processed using the standard modules available with Python. The widely-used DBM database manager is available, plus additional modules are available on the web to provide ODBC access or to connect to a platform-specific database access routine. The following Python modules deal with these kinds of files. These modules are beyond the scope of this book.

anydbm
Generic interface to DBM-style database modules.
whichdb
Guess which DBM-style module created a given database.
dbm
The standard database interface, based on the ndbm library.
gdbm
GNU’s reinterpretation of dbm.
dbhash
DBM-style interface to the BSD database library.
bsddb
Interface to Berkeley DB database library
dumbdbm
Portable implementation of the simple DBM interface.

Additionally, Python provides a relational database module.

sqlite3
A very pleasant, easy-to-use relational database (RDBMS). This handles a wide variety of SQL statements.

File Module Exercises

  1. Source Lines of Code. One measure of the complexity of an application is the count of the number of lines of source code. Often, this count discards comment lines. We’ll write an application to read Python source files, discarding blank lines and lines beginning with #, and producing a count of source lines.

    We’ll develop a function to process a single file. We’ll use the glob module to locate all of the *.py files in a given directory.

    Develop a fileLineCount( name )() which opens a file with the given name and examines all of the lines of the file. Each line should have strip() applied to remove leading and trailing spaces. If the resulting line is of length zero, it was effectively blank, and can be skipped. If the resulting line begins with # the line is entirely a comment, and can be skipped. All remaining lines should be counted, and fileLineCount() returns this count.

    Develop a directoryLineCount( path )() function which uses the path with the glob.glob() to expand all matching file names. Each file name is processed with fileLineCount( name )() to get the number of non-comment source lines. Write this to a tab-delimited file; each line should have the form filenametlines.

    For a sample application, look in your Python distribution for Lib/idelib/*.py.

  2. Summarize a Tab-Delimited File. The previous exercise produced a file where each line has the form filenametlines. Read this tab-delimited file, producing a nicer-looking report that has column titles, file and line counts, and a total line count at the end of the report.

  3. File Processing Pipeline. The previous two exercises produced programs which can be part of a processing pipeline. The first exercise should p should produce it’s output on sys.stdout. The second exercise should gather it’s input from sys.stdin. Once this capability is in place, the pipeline can be invoked using a command like the following:

    $ python lineCounter.py | python lineSummary.py
    

Table Of Contents

Previous topic

Dates and Times: the time and datetime Modules

Next topic

File Formats: CSV, Tab, XML, Logs and Others

This Page