When we first look at the problem we’re trying to solve, it’s often difficult to see how we apply Python. This chapter is really about the question “Now that I know the language, how do I get started on my real problem?”
The answer is – almost always – “What information do you have and what processing do you want to do?” This chapter will help you apply the file abstraction to your problem.
Sources and Sinks. When we look at the information we have, it can flows in one of the following directions. places.
All the processing we want to do will involve files in one way or another. Files are either input or output or both. We’ll focus on disk files because they’re the easiest to work with as a beginner.
We’ll talk about how data is organized on files in File Organization and Structure.
There are a number of library modules that are relevant to file processing.
We can then look at some common variations on file processing in Files are the Plumbing of a Software Architecture.
In order to successfully read data from a file, the bytes, characters and lines must have some kind of organization.
We’ve already seen on file organization up close: CSV format. We looked at this in The csv Module.
From the Ground Up. At the foundation, a file is a sequence of bytes. Any other interpretation of those bytes is the responsibility of the application program. In our case, the program is Python; it includes numerous library modules to handle various kinds of encodings and data organization.
The bytes in a file can have a variety of interpretations. It might be images or financial records or GPS coordinates. Anything is possible. Anything.
The files that are easiest to work with will contain text. By text we mean characters in some well-defined encodding. A very popular encoding is US-ASCII. Other encodings include UTF-8 and UTF-16. The encoding is required to properly interpret the bytes as characters.
A file of text can have higher-level structures. The most basic text file formats interpret the characters as a sequence of variable length lines. Each line terminated with a newline character. The newline character is coded as '\n' in Python.
We could interpret a file of characters as an XML or HTML document. In these cases, the XML or HTML rules tell us how to interpret the characters as tags, coded data, elements, attributes, processing instructions and other elements of these languages.
A file of characters could be understood as a Python program. Usually, we emphasize this by making sure the file name ends in .py.
A .csv file is a sequence of lines. Each line has one or more fields, wrapped in "", and separated by ,s.
Hiding the Details. We use the notion of “layers” to understand the structure of files. At the foundation layer (bytes) to ASCII or Unicode characters to yet higher layers built on this foundation. On top of the character foundation, our choices fan out in many, many directions. We’ll stick to the most common file types: HTML, XML, Python, .CSV.
This idea of layers of meaning – from bytes to characters to lines to meaningful records – is yet another application of the abstraction principle. Python allows us to imagine that a file consists of lines of characters. It provides us an abstraction that conceals the details of all those bytes and how they are encoded. We can do the same thing in our software by writing a function that reads a line and transforms it into a meaningful tuple or object that we can process.
Non-Character Files.
There are many common file formats which do not have obvious character
encodings. Image files contain encodings of the picture elements, pixels.
A photo of your family on vacation in Stockholm might be a 4.6 megapixel
image. This image has
individual dots, each of which can be
any of 16 million different colors. A raw image file could be 14 million
individual bytes of data.
When I look at my computer, I see that the .jpeg files are much smaller. It turns out that these files are compressed. Some clever experts defined ways to reduce the number of bytes required to capture most of the image with enough accuracy that some of us barely notice the difference between a JPEG image and a RAW image.
An audio file might have 48,000 samples per second, spread over three minutes leads to 8.2 million individual samples. Each sample could be one of 4000 amplitude levels, leading us to 12.4 million individual bytes of data. An MP3 file uses a clever algorithm to compress all of these bytes down to 3.5 million bytes that sound pretty much the same as the original AIFF audio file.
Database Files. A database is one or more very highly organized files. A database may contain text, audio and images. That means that the content of a database may contain bytes that must be interpreted as characters, bytes that can be interpreted as MP3-encoded audio, and bytes that can be understood as JPEG-encoded images.
The top-most layer is the “meaning” of all that data and all those bytes. In this case, the database may be a summary of decades of custom quilt-making, with pictures, stories, and descriptions of dozens of quilts.
Tip
Debugging File Formats
When we talk about how data appears in files, we are talking about “data representation.” This is a difficult and sometimes subtle design decision. A common question is “How do I know what the data is?” . There are two important points of view.
The “File and Directory Access” section of the Python Library has a number of very useful modules for working with files and directories.
The os.path module contains operating-system agnostic functions for managing path and directory Names. Since these functions are tailored for each operating system, this is the best way to assure portability of your program.
The os.path module helps us parse and create correct file names. This module addresses the most obvious differences among operating systems: the way that files are named. In particular, the path separator can be either the POSIX standard /, or the windows \. Additionally, there’s a MacOS Classic mode that can also use :. Rather than make each program aware of the operating system rules for path construction, Python provides the os.path module to make all of the common filename manipulations completely consistent.
A serious mistake is to use ordinary string functions with literals for the path separators. For example, a program using \ as the separator will only work on Windows, and won’t work anywhere else. A less serious mistake is to use os.pathsep. The best approach is to use the functions in the os.path module.
The os.path module contains the following functions for completely portable path and filename manipulation.
Return the base filename, the second half of the result created by os.path.split( path )
>>> import os
>>> fn='/Users/slott/Documents/Writing/NonProg2.5/notes/portfolio.py'
>>> os.path.basename( fn )
'portfolio.py'
Return the directory name, the first half of the result created by os.path.split( path )
>>> import os
>>> fn='/Users/slott/Documents/Writing/NonProg2.5/notes/portfolio.py'
>>> os.path.dirname(fn)
'/Users/slott/Documents/Writing/NonProg2.5/notes'
Return True if the pathname refers to an existing file or directory.
Return the last access time of a file, reported by os.stat(). See the time module for functions to process the time value.
>>> import os
>>> import time
>>> fn='/Users/slott/Documents/Writing/NonProg2.5/notes/portfolio.py'
>>> os.path.getatime( fn )
1246637163.0
>>> time.ctime(_)
'Fri Jul 3 12:06:03 2009'
Return the last modification time of a file, reported by os.stat(). See the time module for functions to process the time value.
Return the size of a file, in bytes, reported by os.stat().
>>> import os
>>> fn='/Users/slott/Documents/Writing/NonProg2.5/notes/portfolio.py'
>>> os.path.getsize( fn )
175L
Return True if the pathname refers to an existing directory.
Return True if the pathname refers to an existing regular file.
Join path components using the appropriate path separator. This is the best way to assemble long path names from component pieces. It is operating-system independent, and understands all of the operating system’s punctuation rules.
>>> import os
>>> os.path.join( '/Users', 'slott', 'Documents', 'Writing' )
'/Users/slott/Documents/Writing'
Split a pathname into two parts: the directory and the basename (the filename, without path separators, in that directory). The result (s, t) is such that os.path.join( s, t ) yields the original path.
>>> import os
>>> fn='/Users/slott/Documents/Writing/NonProg2.5/notes/portfolio.py'
>>> os.path.split( fn )
('/Users/slott/Documents/Writing/NonProg2.5/notes', 'portfolio.py')
Split a pathname into a drive specification and the rest of the path. Useful on DOS/Windows/NT. Useless for Linux or Mac OS.
Split a path into root and extension. The extension is everything starting at the last dot in the last component of the pathname; the root is everything before that. The result tuple ( root , ext ) is such that root + ext yields the original path.
>>> import os
>>> fn='/Users/slott/Documents/Writing/NonProg2.5/notes/portfolio.py'
>>> dir, file = os.path.split(fn)
>>> os.path.splitext( file )
('portfolio', '.py')
The following example is typical of the manipulations done with os.path:
from __future__ import print_function
import sys, os.path
def process( oldName, newName ):
Some Processing...
for oldFile in sys.argv[1:]:
dir, fileext= os.path.split(oldFile)
file, ext= os.path.splitext( fileext )
if ext.upper() == ‘.HTML’:
newFile= os.path.join( dir, file ) + ‘.BAK’
print(oldFile, newFile)
process( oldFile, newFile )
This program imports the sys and os.path modules. The variable oldFile is set to each file name that is listed in the sequence sys.argv by the for statement.
Each file name is split into the path name and the base name. The base name is further split to separate the file name from the extension. The os.path does this correctly for all operating systems, saving us having to write platform-specific code. For example, splitext() correctly handles the situation where a Linux file has multiple .s in the file name.
The extension is tested to be .HTML. The processing only applies to these files. A new file name is joined from the path, base name and a new extension (.BAK). The old and new file names are printed and some processing, defined in the process(), uses the oldFile and newFile names.
Path Processing
Programmers are faced with a dilemma between writing a “simple” hack to strip paths or extensions from file names and using the os.path module.
Some programmers argue that the os.path module is too much overhead for such a simple problem as removing the .html from a file name.
Other programmers recognize that most hacks are a false economy: in the long run they do not save time, but rather lead to costly maintenance when the program is expanded or modified.
The shutil module automates copying entire files or directories. This saves the steps of opening, reading, writing and closing files when there is no actual processing, simply moving files.
When we have complex programs that need to preserve a backup copy of a file or rename a file, we have two choices for our design.
Copy data and mode bits, basically the GNU/Linux command cp source destination. If destination is a directory, a file with the same base name as source is created. If destination is a full file name, this is the destination file.
Copy data from source to destination. Both names must be files.
Recursively copy the entire directory tree rooted at source to destination. destination must not already exist. Errors are reported to standard output.
Recursively delete a directory tree rooted at path.
The GNU/Linux shell expands wild-cards to complete lists of file names; the verb is to glob (really). The glob module makes the name globbing capability available to Windows programmers. The glob module includes the following function that locates all names which match a given pattern.
Return a list of filenames that match the given wild-card pattern. The fnmatch module is used for the wild-card pattern matching.
A common use for glob is something like the following.
import glob, sys
for wildcard in sys.argv[1:]:
for f in glob.glob(wildcard):
process( f )
This can make Windows programs process command line arguments somewhat like Unix programs. Each argument is passed to glob.glob() to expand any patterns into a list of matching files. If the argument is not a wild-card pattern, glob simply returns a list containing this one file name.
The fnmatch module has the essential algorithm for matching a wild-card pattern against file names. This module implements the Unix shell wild-card rules. These rules are used by glob to locate all files that match a given pattern. The module contains the following function:
Return True if the filename string matches the pattern string.
The patterns use * to match any number of characters, ? to match any single character. [letters] matches any of these letters, and [!letters] matches any letter that is not in the given set of letters.
>>> import fnmatch
>>> fnmatch.fnmatch('greppy.py','*.py')
True
>>> fnmatch.fnmatch('README','*.py')
False
This chapter describes a number of modules that are specifically designed to be the same in Linux, Mac OS and Windows. By using this module, you can be assured that your Python program will work the same everywhere.
The os module contains an interface to many operating system-specific functions that manipulate processes, files, file descriptors, directories and other operating system resources. This module is specific to the operating system. Programs that import and use os stand a better chance of being portable between different platforms. Portable programs must depend only on functions that are supported for all platforms (e.g., unlink() and opendir()), and leave all pathname manipulation to os.path.
The os module exports a number of things. These constants are like variables, but changing their value will not have any beneficial effects on your program. The following definitions in this module provide useful information about the operating system.
One of POSIX, nt, dos, os2, mac, or ce.
String representing the current directory (., generally)
String representing the parent directory (.., generally)
The (or the most common) pathname separator character ( / generally, \ on Windows). Most of the Python library routines will translate the standard / for use on Windows.
It is better to use the os.path module to construct or parse path names.
The alternate pathname separator (None generally, or / on Windows).
The component separator used in $PATH (: generally, ; on Windows).
The line separator in text files (the standard newline character, \n, or the Windows variant, \r\n). This is already part of the readlines() function and the file iterator.
The default search path that the operating system uses to find an executable file.
Change the current working directory to path.
import os
os.chdir( "/Volumes/Slott02/Writing/Tech/PFNP/Notes" )
Return the current working directory path.
import os
print(os.getcwd())
Delete ( “remove”, “unlink” or “erase”) the file.
Delete ( “remove”, “unlink” or “erase”) the file.
Many Python programs will also deal with Python objects that are exported from memory to external files or retrieved from files to memory. Since an external file is more persistent than the volatile working memory of a computer, this process makes an object persistent or retrieves a persistent object. One mechanism for creating a persistent object is called serialization, and is supported by several modules, which are beyond the scope of this book.
- pickle
- Convert between streams of bytes (on a file) and Python objects. This is very nice for saving a Python object to a disk file.
- cPickle
- Faster version of pickle, but you cannot subclass any of the classes, since it’s written in C, not Python.
- copy_reg
- Register pickle support functions.
- shelve
- Python object persistence.
- marshal
- Convert Python objects to streams of bytes and back (with different constraints).
- sqlite3
- This is a SQL-compatible relational database. It does a great deal and is very sophisticated.
Additionally, modules to access the widely-used DBM database manager is available.
These modules can help work with populate file compression and archiving formats. These formats include .zip files as well as .tar files and .gzip and .gz files.
- zlib
- A module for reading and writing data that has been compressed with the ZIP standard compression algorithms.
- gzip
- A module for reading and writing data that has been compressed with the GNU ZIP compression algorithms.
- bz2
- A module for reading and writing data that has been compressed with the GNU BZ2 compression algorithms.
- zipfile
- A module for reading or creating a zip-format archive file.
- tarfile
- A module for reading or creating a TAR-format archive file.
Reading and processing files of Internet data types is very common. Internet data types have formal definitions governed by the internet standards, called Requests for Comments (RFC’s). The following modules are for handling Internet data structures. These modules and the related standards are beyond the scope of this book. We provide them as signposts so that you can research available modules and not reinvent each of these various wheels.
- An email and MIME handling package.
- base64
- Encode and decode files using the MIME base64 data.
- binhex
- Encode and decode files in binhex4 format.
- binascii
- Tools for converting between binary and various ASCII-encoded binary representations.
- quopri
- Encode and decode files using the MIME quoted-printable encoding.
- uu
- Encode and decode files in uuencode format.
There are a number of modules described in the Runtime Services section of the Library Reference. We want to emphasize just one, sys.
The sys module provides access to some objects used or maintained by the interpreter and to functions that interact with the interpreter.
Most importantly, the sys module provides access the three standard OS files used by Python.
Standard input file object; used by the raw_input() function. Also available via sys.stdin.read() and related methods of the file object.
Standard output file object; used by the print() function. Also available via sys.stdout.write() and related methods of the file object.
Standard error object; used for error messages, typically unhandled exceptions. Available via sys.stderr.write() and related methods of the file object.
This can be used as follows:
from __future__ import print_function
import sys
print("some error message", file=sys.stderr)
There are as many software architectures as there are architects. All of these architectures are collections of software components that are connected by files. As newbies, we can look at four common architectural variations. You’re reading this book because you have a particular problem you want to solve. At this point, it may not be obvious how to get from the broadly-defined concept of file in the previous section down to the brass tacks of a working application program. This section will look at the ways in which applications are assembled from the available components and held together with files.
We’ll look at the following architectural patterns to show you what kinds of file processing you’ll need to do.
Command-Line Interface Applications. The GNU/Linux world has hundreds, perhaps thousands of CLI applications. Windows also has a large number. Everything from the common ls command to more complex commands like java and python all work by reading and writing files. All of these command-line applications have some common features. These features are so important, that we’ll devote all of Fit and Finish: Complete Programs to this subject.
There are a few central fittings to making a useful command-line application. An excellent example is the GNU/Linux grep program (or the Windows find program).
File operations you will use.
Graphic User Interface Applications. GUI applications include IDLE, your favorite word processor, spread sheet and web browser. Most of what we use computers for are the GUI applications. In a few cases, the GUI application is a wrapper or veneer that surrounds and underlying command-line application.
There are a few central fittings to making a useful GUI application. An excellent example is IDLE.
You will often create a file object, given the name of a disk file. After all, that’s usually the point of using an application. The Python programming will read or write characters using that file object.
Since you’re using the graphics library to interact with the mouse, keyboard and display devices, you won’t use files for these user interaction devices directly.
Web Applications. You use a web application when you run a web browser like FireFox, Safari, Chrome, Opera or Internet Explorer. Your browser is a GUI application: it reads from the mouse and keyboard and displays back to the user. Browsers use sophisticated graphics libraries, some of which are highly tailored toward doing browsing.
More important, however, is the role the browser plays in the overall application. A browser application connects you with a web server. When you request a web page (by typing the URL or clicking on a link), your browser makes a request from a web sever. When you fill in a form and click a submit (or search or buy now) button, you are making a request of a web server.
Writing a web application means putting the right programming on a web server. Web programming happens in a variety of forms, and uses a number of different languages. The reason for the complexity of web applications is to spread out the workload and allow a large number of people to make requests and efficiently share the web server.
The core of web applications is the HTML language. When you make a web request, the reply is almost always a page of HTML. Your web browser opens a kind of file called a socket. The browser writes the request, and then reads the reply. The reply will be HTML which is rendered and presented as “page” of content.
Serving Web Content. On the other side of the web transaction, the web server is waiting for requests from browsers. The server reads the request, locates the content, and sends the HTML page to the browser. The browser will also request the various pieces of “media” (graphics, sounds, etc.), which are sent separately.
Some HTML pages are static, which means that the web server takes an HTML file from the disk and sends it through the internet to your browser. This job is very simple and easily standardized. A program named Apache httpd handles this job very nicely.
Some HTML pages are dynamic, which means that some program created customized HTML, and sent this through the internet to your browser. Often, this program will be a partner with Apache httpd. Generally, you’ll simplify your life by using a web framework for this kind of programming.
File operations you might use in a web program.
You don’t have access to the user’s computer or anything on the user’s computer; only the browser can do that. All of your file operations are confined to your web server. You can, through HTML, make it easy for someone to download files to their desktop computer, but you have no direct access.
The general approach is to use any of the Python web-frameworks. You can research Django, TurboGears, Quixote and Zope to see a spectrum of just a few alternatives. There are dozens of frameworks to help you manage these popular kinds of applications.
Embedded Control Applications. Let’s imagine that we are inventing a new kind of heat pump controlled by a computer. We’ve bought our heating and refrigeration coils, we’ve got a reversing valve and a variable-speed motor. We’ve rigged up a working set of hardware in our garage, but we need a computer and software to control all of this hardware.
We’ll need to create interfaces that transform information from the outside world like temperature, pressure, valve position, motor speed into electronic signals the computer can read. We’ll also need to transform electronic signals into actions like starting a motor or changing a value position. We need to purchase and configure the necessary computer parts. We also need to write device drivers.
Our device drivers are the glue that connects our file system to our temperature probes, coolant pressure sensors, valve position sensor and motor speed indicator. Each of these devices can appear as a file. When we read from the temperature file, for example, our driver uses this request to gather information from the thermistor, encode that as a number, and provide this number to our program.
While there’s a large amount of computer engineering involved, you will still use some standard file operations. You will create a file object, given the name of a device which appears as a file. You will read or write data using that file object.
Files, Contexts and Patterns of Processing
Enter search terms or a module, class or function name.