There are a number of operations closely related to file processing. Deleting and renaming files are examples of operations that change the directory information that the operating system maintains to describe a file. Python provides numerous modules for these operating system operations.
We can’t begin to cover all of the various ways in which Python supports file handling. However, we can identify the essential modules that may help you avoid reinventing the wheel. Further, these modules can provide you a view of the Pythonic way of working with data from files.
The following modules have features that are essential for supporting file processing. We’ll cover selected features of each module that are directly relevant to file processing. We’ll present these in the order you’d find them in the Python library documentation.
Chapter 11 – File and Directory Access. Chapter 11 of the Library reference covers many modules which are essential for reliable use of files and directories. We’ll look closely at the following modules.
This module has functions for numerous common pathname manipulations. Use these functions to split and join full directory path names. This is operating-system neutral, with a correct implementation for all operating systems.
One of the most obvious differences among operating systems is the way that files are named. In particular, the path separator can be either the POSIX standard /, or the windows \. Rather than make each program aware of the operating system rules for path construction, Python provides the os.path module to make all of the common filename manipulations completely consistent.
Chapter 12 – Data Persistence. There are several issues related to making objects persistent. We’ll look at these modules in detail in File Formats: CSV, Tab, XML, Logs and Others.
Chapter 13 – Data Compression and Archiving. Data Compression is covered in Chapter 12 of the Library referece. We’ll look closely at the following modules.
Chapter 14 – File Formats. These are modules for reading and writing files in a few of the amazing variety of file formats that are in common use. We’ll focus on just a few.
Chapter 16 - Generic Operating System Services. The following modules contain basic features that are common to all operating systems
Chapter 28 - Python Runtime Services. These modules described in Chapter 26 of the Library reference include some that are used for handling various kinds of files. We’ll look closely as just one.
The os.path module contains more useful functions for managing path and directory names. A serious mistake is to use ordinary string functions with literal stringsfor the path separators. A Windows program using \ as the separator won’t work anywhere else. A less serious mistake is to use os.pathsep instead of the routines in the os.path module.
The os.path module contains the following functions for completely portable path and filename manipulation.
The following example is typical of the manipulations done with os.path.
import sys, os.path
def process( oldName, newName ):
Some Processing...
for oldFile in sys.argv[1:]:
dir, fileext= os.path.split(oldFile)
file, ext= os.path.splitext( fileext )
if ext.upper() == '.RST':
newFile= os.path.join( dir, file ) + '.HTML'
print oldFile, '->', newFile
process( oldFile, newFile )
The os module contains an interface to many operating system-specific functions to manipulate processes, files, file descriptors, directories and other “low level” features of the OS. Programs that import and use os stand a better chance of being portable between different platforms. Portable programs must depend only on functions that are supported for all platforms (e.g., unlink() and opendir()), and perform all pathname manipulation with os.path.
The os module exports the following variables that characterize your operating system.
The (or a most common) pathname separator ('/' or ':' or '\') and the alternate pathname separator (None or '/'). Most of the Python library routines will translate '/' to the correct value for the operating system (typically, '\' on Windows.)
It is best to always use os.path rather than these low-level constants.
The os module has a large number of functions. Many of these are not directly related to file manipulation. However, a few are commonly used to create and remove files and directories. Beyond these basic manipulations, the shutil module supports a variety of file copy operations.
Here’s a short example showing some of the functions in the os module.
>>> import os
>>> os.chdir("/Users/slott")
>>> os.getcwd()
'/Users/slott'
>>> os.listdir(os.getcwd())
['.bash_history', '.bash_profile',
'.bash_profile.pysave', '.DS_Store',
'.filezilla', '.fonts.cache-1', '.fop', '.idlerc', '.isql.history',
'.lesshst', '.login_readpw',
'.monitor', '.mozilla', '.sh_history', '.sqlite_history',
'.ssh', '.subversion', '.texlive2008', '.Trash', '.viminfo', '.wxPyDemo',
'.xxe', 'Applications', 'argo.user.properties', 'Burn Folder.fpbf',
'Desktop', 'Documents', 'Downloads', 'Library', 'Movies',
'Music', 'Pictures', 'Public', 'Sites']
The fileinput module interacts with sys.argv. The fileinput.input() function opens files based on all the values of sys.argv[1:]. It carefully skips sys.argv[0], which is the name of the Python script file. For each file, it reads all of the lines as text, allowing a program to read and process multiple files, like many standard Unix utilities.
The typical use case is:
import fileinput
for line in fileinput.input():
process(line)
This iterates over the lines of all files listed in sys.argv[1:], with a default of sys.stdin if the list is empty. If a filename is - it is also replaced by sys.stdin at that position in the list of files. To specify an alternative list of filenames, pass it as the argument to input(). A single file name is also allowed in addition to a list of file names.
While processing input, several functions are available in the fileinput module:
All files are opened in text mode. If an I/O error occurs during opening or reading a file, the IOError exception is raised.
Example. We can use fileinput to write a Python version of the common Unix utility, grep. The grep utility searches a list of files for a given pattern.
For non-Unix users, the grep utility looks for the given regular expression in any number of files. The name grep is an acronym of Global Regular Expression Print. This is similar to the Windows command find.
greppy.py
#!/usr/bin/env python
import sys
import re
import fileinput
pattern= re.compile( sys.argv[1] )
for line in fileinput.input(sys.argv[2:]):
if pattern.match( line ):
print fileinput.filename(), fileinput.filelineno(), line
The sys module provides access to the command-line parameters. The re module provides the pattern matching. The fileinput module makes searching an arbitrary list of files simple. For more information on re, see Complex Strings: the re Module.
The first command line argument ( sys.argv[0] ) is the name of the script, which this program ignores.
The second command-line argument is the pattern that defines the target of the search.
The remaining command-line arguments are given to fileinput.input() so that all the named files will be examined.
The pattern regular expression is matched against each individual input line.
If match() returns None, the line did not match. If match() returns an object, the program prints the current file name, the current line number of the file and the actual input line that matched.
After we do a chmod +x greppy.py, we can use this program as follows. Note that we have to provide quotes to prevent the shell from doing globbing on our pattern string.
$ greppy.py 'import.*random' *.py
demorandom.py 2 import random
dice.py 1 import random
functions.py 2 import random
The glob module adds a necessary Unix shell capability to Windows programmers. The glob module includes the following function
A common use for glob is something like the following under Windows.
import glob, sys
for arg in sys.argv[1:]:
for f in glob.glob(arg):
process( f )
This makes Windows programs process command line arguments somewhat like Unix programs. Each argument is passed to glob.glob() to expand any patterns into a list of matching files. If the argument is not a wild-card pattern, glob simply returns a list containing this one file name.
The fnmatch module has the algorithm for actually matching a wild-card pattern against specific file names. This module implements the Unix shell wild-card rules. These are not the same as the more sophisticated regular expression rules. The module contains the following function:
The patterns use * to match any number of characters, ? to match any single character. [letters] matches any of these letters, and [!letters] matches any letter that is not in the given set of letters.
>>> import fnmatch
>>> fnmatch.fnmatch('greppy.py','*.py')
True
>>> fnmatch.fnmatch('README','*.py')
False
One common problem is to open a unique temporary file to hold intermediate results. The tempfile module provides us a handy way to create temporary files that we can write to and read from.
The tempfile module creates a temporary file in the most secure and reliable manner. The tempfile module includes an internal function, mkstemp() which dioes the hard work of creating a unique temporary file.
This function creates a file which is automatically deleted when it is closed. All of the parameters are optional. By default, the mode is 'w+b': write with read in binary mode.
The bufsize parameter has the same meaning as it does for the built-in open() function.
The keyword parameters suffix, prefix and dir provide some structure to the name assigned to the file. The suffix should include the dot, for example suffix='.tmp'.
This function is similar to TemporaryFile() ; it creates a file which can be automatically deleted when it is closed. The temporary file, however, is guaranteed to be visible on the file system while the file is open.
The parameters are the same as a tempfile.TemporaryFile().
If the delete parameter is False, the file is not automatically deleted.
import tempfile, os
fd, tempName = tempfile.mkstemp( '.d1' )
temp= open( tempName, 'w+' )
Some Processing...
This fragment will create a unique temporary file name with an extension of .d1. Since the name is guaranteed to be unique, this can be used without fear of damaging or overwriting any other file.
The shutil module helps you automate copying files and directories. This saves the steps of opening, reading, writing and closing files when there is no actual processing, simply moving files.
Note that removing a file is done with os.remove() (or os.unlink()).
This module allows us to build Python applications that are like shell scripts. There are a lot of advantages to writing Python programs rather than shell scripts to automate mundane tasks.
First, Python programs are easier to read than shell scripts. This is because the language did not evolve in way that emphasized tersness; the shell script languages use a minimum of punctuation, which make them hard to read.
Second, Python programs have a more sophisticated programming model, with class definitions, and numerous sophisticated data structures. The shell works with simple argument lists; it has to resort to running the test or expr programs to process string s or numbers.
Finally, Python programs have direct access to more of the operating system’s features than the shell. Generally, many of the basic GNU/Linux API calls are provided via innumerable small programs. Rather than having the shell run a small program to make an API call, Python can simply make the API call.
An archive file contains a complex, hierarchical file directory in a single sequential file. The archive file includes the original directory information as well as a the contents of all of the files in those directories. There are a number of archive file formats, Python directory supports two: tar and zip archives.
The tar (Tape Archive) format is widely used in the GNU/Linux world to distribute files. It is a POSIX standard, making it usable on a wide variety of operating systems. A tar file can also be compressed, often with the GZip utility, leading to .tgz or .tar.gz files which are compressed archives.
The Zip file format was invented by Phil Katz at PKWare as a way to archive a complex, hierarchical file directory into a compact sequential file. The Zip format is widely used but is not a POSIX standard. Zip file processing includes a choice of compression algorithms; the exact algorithm used is encoded in the header of the file, not in the name of file.
Creating a TarFile or a ZipFile. Since an archive file is still – essentially – a single file, it is opened with a variation on the open() function. Since an archive file contains directory and file contents, it has a number of methods above and beyond what a simple file has.
This module-level function opens the given tar file for processing. The name is a file name string; it is optional because the fileobj can be used instead. The fileobject is a conventional file object, which can be used instead of the name; it can be a standard file like sys.stdin. The buffersize is like the built-in open() function.
The mode is similar to the built-in open() function; it has numerous additional characters to specify the compression algorithms, if any.
This class constructor opens the given zip file for processing. The name is a file name string. The mode is similar to the built-in open() function. The compression is the compression code. It can be zipfile.ZIP_STORED or zipfile.ZIP_DEFLATED. A compression of ZIP_STORED uses no compression; a value of ZIP_DEFLATED uses the Zlib compression algorithms.
The allowZip64 option is used when creating new, empty Zip Files. If this is set to True, then this will create files with the ZIP64 extensions. If this is left at False, any time a ZIP64 extension would be required will raise an exception.
The open function can be used to read or write the archive file. It can be used to process a simple disk file, using the filename. Or, more importantly, it can be used to process a non-disk file: this includes tape devices and network sockets. In the non-disk case, a file object is given to tarfile.open().
For tar files, the mode information is rather complex because we can do more than simply read, write and append. The mode string adresses three issues: the kind of opening (reading, writing, appending), the kind of access (block or stream) and the kind of compression.
For zip files, however, the mode is simply the kind of opening that is done.
Opening - Both zip and tar files. A zip or tar file can be opened in any of three modes.
Access - tar files only. A tar file can have either of two fundamentally different kinds of access. If a tar file is a disk file, which supports seek and tell operations, then you we access the tar file in block mode. If the tar file is a stream, network connection or a pipeline, which does not support seek or tell operations, then we must access the archive in stream mode.
This access distinction isn’t meaningful for zip files.
Compression - tar files only. A tar file may be compressed with GZip or BZip2 algorithms, or it may be uncompressed. Generally, you only need to select compression when writing. It doesn’t make sense to attempt to select compression when appending to an existing file, or when reading a file.
This compression distinction isn’t meaningful for zip files. Zip file compression is specified in the zipfile.ZipFile constructor.
Tar File Examples. The most common block modes for tar files are r, a, w:, w:gz, w:bz2. Note that read and append modes cannot meaningfully provide compression information, since it’s obvious from the file if it was compressed, and which algorithm was used.
For stream modes, however, the compression information must be provided. The modes include all six combinations: r|, r|gz, r|bz2, w|, w|gz, w|bz2.
Directory Information. Each individual file in a tar archive is described with a TarInfo object. This has name, size, access mode, ownership and other OS information on the file. A number of methods will retrieve member information from an archive.
Each individual file in a zip archive is described with a ZipInfo object. This has name, size, access mode, ownership and other OS information on the file. A number of methods will retrieve member information from an archive.
Extracting Files From an Archive. If a tar archive is opened with r, then you can read the archive and extract files from it. The following methods will extract member files from an archive.
If a zip archive is opened with r, then you can read the archive and extract the contents of a file from it.
Creating or Extending an Archive. If a tar archive is opened with w or a, then you can add files to it. The following methods will add member files to an archive.
If a zip archive is opened with w or a, then you can add files to it.
A tarfile Example. Here’s an example of a program to examine a tarfile, looking for documentation like .html files or README files. It will provide a list of .html files, and actually show the contents of the README files.
readtar.py
#!/usr/bin/env python
"""Scan a tarfile looking for *.html and a README."""
import tarfile
import fnmatch
archive= tarfile.open( "SQLAlchemy-0.3.5.tar.gz", "r" )
for mem in archive.getmembers():
if fnmatch.fnmatch( mem.name, "*.html" ):
print mem.name
elif fnmatch.fnmatch( mem.name.upper(), "*README*" ):
print mem.name
docFile= archive.extractfile( mem )
print docFile.read()
A zipfile Example. Here’s an example of a program to create a zipfile based on the .xml files in a particular directory.
writezip.py
import zipfile, os, fnmatch
bookDistro= zipfile.ZipFile( 'book.zip', 'w', zipfile.ZIP_DEFLATED )
for nm in os.listdir('..'):
if fnmatch.fnmatch(nm,'*.xml'):
full= os.path.join( '..', nm )
bookDistro.write( full )
bookDistro.close()
The sys module provides access to some objects used or maintained by the interpreter and to functions that interact strongly with the interpreter.
The sys module also provides the three standard files used by Python.
| sys.stdin: | Standard input file object; used by raw_input() and input(). Also available via sys.stdin.read() and related methods of the file object. |
|---|---|
| sys.stdout: | Standard output file object; used by the print statement. Also available via sys.stdout.write() and related methods of the file object. |
| sys.stderr: | Standard error object; used for error messages, typically unhandled exceptions. Available via sys.stderr.write() and related methods of the file object. |
A program can assign another file object to one of these global variables. When you change the file for these globals, this will redirect all of the interpreter’s I/O.
Command-Line Arguments. One important object made available by this module is the variable sys.argv. This variable has the command line arguments used to run this script. For example, if we had a python script called portfolio.py, and executed it with the following command:
python portfolio.py -xvb display.csv
Then the sys.argv list would be ["portfolio.py", "-xvb", "display.csv"]. Sophisticated argument processing is done with the optparse module.
A few other interesting objects in the sys module are the following variables.
| sys.version: | The version of this interpreter as a string. For example, '2.6.3 (r263:75184, Oct 2 2009, 07:56:03) n[GCC 4.0.1 (Apple Inc. build 5493)]' |
|---|---|
| sys.version_info: | |
| Version information as a tuple, for example: (2, 6, 3, 'final', 0). | |
| sys.hexversion: | Version information encoded as a single integer. Evaluating hex(sys.hexversion) yields '0x20603f0'. Each byte of the value is version information. |
| sys.copyright: | Copyright notice pertaining to this interpreter. |
| sys.platform: | Platform identifier, for example, 'darwin', 'win32' or 'linux2'. |
| sys.prefix: | File Path prefix used to find the Python library, for example '/usr', '/Library/Frameworks/Python.framework/Versions/2.5' or 'c:\Python25'. |
There are several other chapters of the Python Library Reference that cover with even more file formats. We’ll identify them briefly here.
Chapter 7 – Internet Data Handling. Reading and processing files of Internet data types is very common. Internet data types have formal definitions governed by the internet standards, called Requests for Comment (RFC’s). The following modules are for handling Internet data structures. These modules and the related standards are beyond the scope of this book.
Chapter 13 - Data Persistence. Many Python programs will also deal with Python objects that are exported from memory to external files or retrieved from files to memory. Since an external file is more persistent than the volatile working memory of a computer, this process makes an object persistent or retrieves a persistent object. One mechanism for creating a persistent object is called serialization, and is supported by several modules, which are beyond the scope of this book.
More complex file structures can be processed using the standard modules available with Python. The widely-used DBM database manager is available, plus additional modules are available on the web to provide ODBC access or to connect to a platform-specific database access routine. The following Python modules deal with these kinds of files. These modules are beyond the scope of this book.
Additionally, Python provides a relational database module.
Source Lines of Code. One measure of the complexity of an application is the count of the number of lines of source code. Often, this count discards comment lines. We’ll write an application to read Python source files, discarding blank lines and lines beginning with #, and producing a count of source lines.
We’ll develop a function to process a single file. We’ll use the glob module to locate all of the *.py files in a given directory.
Develop a fileLineCount( name )() which opens a file with the given name and examines all of the lines of the file. Each line should have strip() applied to remove leading and trailing spaces. If the resulting line is of length zero, it was effectively blank, and can be skipped. If the resulting line begins with # the line is entirely a comment, and can be skipped. All remaining lines should be counted, and fileLineCount() returns this count.
Develop a directoryLineCount( path )() function which uses the path with the glob.glob() to expand all matching file names. Each file name is processed with fileLineCount( name )() to get the number of non-comment source lines. Write this to a tab-delimited file; each line should have the form filenametlines.
For a sample application, look in your Python distribution for Lib/idelib/*.py.
Summarize a Tab-Delimited File. The previous exercise produced a file where each line has the form filenametlines. Read this tab-delimited file, producing a nicer-looking report that has column titles, file and line counts, and a total line count at the end of the report.
File Processing Pipeline. The previous two exercises produced programs which can be part of a processing pipeline. The first exercise should p should produce it’s output on sys.stdout. The second exercise should gather it’s input from sys.stdin. Once this capability is in place, the pipeline can be invoked using a command like the following:
$ python lineCounter.py | python lineSummary.py