The file Class; The with Statement
Programs often deal with external data; data outside of volatile primary memory. This external data could be persistent data on a file system or transient data on an input-output device. Most operating systems provide a simple, uniform interface to external data via objects of the file class.
In File Semantics, we provide an overview of the semantics of files. We cover the most important of Python’s built-in functions for working with files in Built-in Functions. We’ll review statements for dealing with files in File Statements. In File Methods, we describe some method functions of file objects.
Files are a deep, deep subject. We’ll touch on several modules that are related to managing files in Components, Modules and Packages. These include File Handling Modules and File Formats: CSV, Tab, XML, Logs and Others.
In one sense a file is a container for a sequence of bytes. A more useful view, however, is that a file is a container of data objects, encoded as a sequence of bytes. Files can be kept on persistent but slow devices like disks. Files can also be presented as a stream of bytes flowing through a network interface. Even the user’s keyboard can be processed as if it was a file; in this case the file forces our software to wait until the person types something.
Our operating systems use the abstraction of file as a way to unify access to a large number of devices and operating system services. In the Linux world, all external devices, plus a large number of in-memory data structures are accessible through the file interface. The wide variety of things with file-like interfaces is a consequence of how Unix was originally designed. Since the number and types of devices that will be connected to a computer is essentially infinite, device drivers were designed as a simple, flexible plug-in to the operating system. For more information on the ubiquity of files, see Additional Background.
Files include more than disk drives and network interfaces. Kernel memory, random data generators, semaphores, shared memory blocks, and other things have file interfaces, even though they aren’t – strictly speaking – devices. Our OS applies the file abstraction to many things. Python, similarly, extends the file interface to include certain kinds of in-memory buffers.
All GNU/Linux operating systems make all devices available through a standard file-oriented interface. Windows makes most devices available through a reasonably consistent file interface. Python’s file class provides access to the OS file API’s, giving our applications the same uniform access to a variety of devices.
The terminology is sometimes confusing. We have physical files on our disk, the file abstraction in our operating system, and file objects in our Python program. Our Python file object makes use of the operating system file API’s which, in turn, manipulate the files on a disk.
We’ll try to be clear, but with only one overloaded word for three different things, this chapter may sometimes be confusing.
We rarely have a reason to talk about a physical file on a disk. Generally we’ll talk about the OS abstraction of file and the Python class of file.
Standard Files. Consistent with POSIX standards, all Python programs have three files available: sys.stdin, sys.stdout, sys.stderr. These files are used by certain built-in statements and functions. The print statement (and print() function), for example, writes to sys.stdout by default. The raw_input() function writes the prompt to sys.stdout and reads the input from sys.stdin.
These standard files are always available, and Python assures that they are handled consistently by all operating systems. The sys module makes these files available for explict use. Newbies may want to check File Redirection for Newbies for some additional notes on these standard files.
Some operating systems provide support for a large variety of file organizations. Different file organizations will include different record termination rules, possibly with record keys, and possibly fixed length records. The POSIX standard, however, considers a file to be nothing more than a sequence of bytes. It becomes entirely the job of the application program, or libraries outside the operating system to impose any organization on those bytes.
The basic file objects in Python consider a file to be a sequence of text characters (ASCII or Unicode) or bytes. The characters can be processed as a sequence of variable length lines; each line terminated with a newline character. Files moved from a Windows environment may contain lines which appear to have an extraneous ASCII carriage return character (r), which is easily removed with the string strip() method.
Ordinary text files can be managed directly with the built-in file objects and their methods for reading and writing lines of data. We will cover this basic text file processing in the rest of this chapter.
Files which are a sequence of bytes don’t – properly – have line boundaries. Byte-oriented files could include characters (in ASCII or a Unicode encoding) or other data objects encoded as bytes. We’ll address some byte-oriented files with library modules like pickle and csv.
The GNU/Linux view of files can be surprising for programmers with a background that focuses on mainframe Z/OS or Windows. This is additional background information for programmers who are new to the POSIX use of the file abstraction. This POSIX view informs how Python works.
In the Z/OS world, files are called data sets, and can be managed by the OS catalog or left uncataloged. Uncataloged data sets are fairly common.
In the GNU/Linux world, the catalog (called a directory) is seamless, silent and automatic. Files are almost always cataloged In the GNU/Linux world, uncataloged, temporary files are atypical, rarely used, and require a special API call.
In the Z/OS world, files are generally limited to disk files and nothing else. This is different from the GNU/Linux use of file to mean almost any kind of external device or service.
Block Mode Files. File devices can be organized into two different kinds of structures: block mode and character mode. Block mode devices are exemplified by magnetic disks: the data is structured into blocks of bytes that can be accessed in any order. Both the media (disk) and read-write head can move; the device can be repositioned to any block as often as necessary. A disk provides direct (sometimes also called random) access to each block of data.
Character mode devices are exemplified by network connections: the bytes come pouring into the processor buffers. The stream cannot be repositioned. If the buffer fills up and bytes are missed, the lost data are gone forever.
Operating system support for block mode devices includes file directories and file management utilities for deleting, renaming and copying files. Modern operating systems include file navigators (Finders or Explorers), iconic representations of files, and standard GUI dialogs for opening files from within application programs. The operating system also handles moving data blocks from memory buffers to disk and from disk to memory buffers. All of the device-specific vagaries are handled by having a variety of device drivers so that a range of physical devices can be supported in a uniform manner by a single operating system software interface.
Files on block mode devices are sometimes called seekable. They support the operating system seek() function that can begin reading from any byte of the file. If the file is structured in fixed-size blocks or records, this seek function can be very simple and effective. Typically, database applications are designed to work with fixed-size blocks so that seeking is always done to a block, from which database rows are manipulated.
Character Mode Devices and Keyboards. Operating systems also provide rich support for character mode devices like networks and keyboards. Typically, a network connection requires a protocol stack that interprets the bytes into packets, and handles the error correction, sequencing and retransmission of the packets. One of the most famous protocol stacks is the TCP/IP stack. TCP/IP can make a streaming device appear like a sequential file of bytes. Most operating systems come with numerous client programs that make heavy use of the netrowk, examples include sendmail, ftp, and a web browser.
A special kind of character mode file is the console; it usually provides input from the keyboard. The POSIX standard allows a program to be run so that input comes from files, pipes or the actual user. If the input file is a TTY (teletype), this is the actual human user’s keyboard. If the file is a pipe, this is a connection to another process running concurrently. The keyboard console or TTY is different from ordinary character mode devices, pipes or files for two reasons. First, the keyboard often needs to explicitly echo characters back so that a person can see what they are typing. Second, pre-processing must often be done to make backspaces work as expected by people.
The echo feature is enabled for entering ordinary data or disabled for entering passwords. The echo feature is accomplished by having keyboard events be queued up for the program to read as if from a file. These same keyboard events are automatically sent to update the GUI if echo is turned on.
The pre-processing feature is used to allow some standard edits of the input before the application program receives the buffer of input. A common example is handling the backspace character. Most experienced computer users expect that the backspace key will remove the last character typed. This is handled by the OS: it buffers ordinary characters, removes characters from the buffer when backspace is received, and provides the final buffer of characters to the application when the user hits the Return key. This handling of backspaces can also be disabled; the application would then see the keyboard events as raw characters. The usual mode is for the OS to provide cooked characters, with backspace characters handled before the application sees any data.
Typically, this is all handled in a GUI in modern applications. However, Python provides some functions to interact with Unix TTY console software to enable and disable echo and process raw keyboard input.
File Formats and Access Methods. In Z/OS (and Open VMS, and a few other operating systems) files have very specific formats, and data access is mediated by the operating system. In Z/OS, they call these access methods, and they have names like BDAM or VSAM. This view is handy in some respects, but it tends to limit you to the access methods supplied by the OS vendor.
The GNU/Linux view is that files should be managed minimally by the operating system. At the OS level, files are just bytes. If you would like to impose some organization on the bytes of the file, your application should provide the access method. You can, for example, use a database management system (DBMS) to structure your bytes into tables, rows and columns.
The C-language standard I/O library (stdio) can access files as a sequence of individual lines; each line is terminated by the newline character, n . Since Python is built in the C libraries, Python can also read files as a sequence of lines.
There are two built-in functions that create a new file or open an existing file.
Does the same thing as the open(). This is present so that the name of this factory function matches the class of the object being created.
The open() function is more descriptive of what is really going on in the program.
The file() function is used for type comparisons.
Creating File Name Strings. A filename string can be given as a standard name, or it can use OS-specific punctuation. The standard is to use / to separate elements of a file path; Python can do OS-specific translation.
Windows, however, uses \ for most levels of the path, but has a leading device character separated by a : .
Rather than force your program to implement the various operating system punctuation rules, Python provides modules to help you construct and process file names. The os.path module should be used to construct file names. Best practice is to use the os.path.join() function to make file names from sequences of strings. We’ll look at this in File Handling Modules.
The filename string can be a simple file name, also called a relative path string, where the OS rules of applying a current working directory are used to create a full, absolute path. Or the filename string can be a full absolute path to the file.
File Mode Strings. The mode string specifies how the file will be accessed by the program. There are three separate issues addressed by the mode string: opening, text handling and operations.
Opening. For the opening part of the mode string, there are three alternatives:
Open for reading. Start at the beginning of the file. If the file does not exist, raise an IOError exception. This is implied if nothing else is specified.
Open for writing. Start at he beginning of the file, overwriting an existing file. If the file does not exist, create the file.
Open for appending. Start at the end of the file. If the file does not exist, create the file.
Text Handling. For the text handling part of the mode string, there are two alternatives:
Interpret the file as bytes, not text.
The default, if nothing is specified is to interpret the content as text: a sequence of characters with newlines at the end of each line.
The capital U mode (when used with r) enables “universal newline” reading. This allows your program to cope with the non-standard line-ending characters present in some Windows files. The standard end-of-line is a single newline character, \n. In Windows, an additional \r character may also be present.
Operations. For the additional operations part of the mode string, there are two alternatives:
Allow both read and write operations.
If nothing is specified, allow only reads for files opened with “r”; allow only writes for files opened with “w” or “a”.
Typical combinations include "rb" to read data as bytes and "w+" to create a text file for reading and writing.
Examples. The following examples create file objects for further processing:
myLogin = open( ".login", "r" ) newSource = open( "somefile.c", "w" ) theErrors = open( "error.log", "a" ) someData = open( 'source.dat', 'rb' )
|myLogin:||A text file, opened for reading.|
|newSource:||A text file, opened for writing. If the file exists, it is overwritten.|
|theErrors:||A text file, opened for appending. If the file doesn’t exist, it’s created.|
|someData:||A binary file, opened for reading.|
Buffering files is typically left as a default, specifying nothing. However, for some situations buffering can improve performance. Error logs, for instance, are often unbuffered, so the data is available immediately. Large input files may have large buffer numbers specified to encourage the operating system to optimize input operations by reading a few large chunks of data instead of a large number of smaller chunks.
There are a number of statements that have specific features related to tuple objects.
The for Statement. Principally, we use the for statement to work with files. Text files are iterable, making them a natural fit with the for statement.
The most common pattern is the following:
source = open( "someFile.txt", "r" ) for line in source: # process line source.close()
Additionally, we use the with statement with files. This assures that we have – without exception – closed the file when we’re done using it.
The with Statement. The with statement is used to be absolutely sure that we have closed a file (or other resource) when we’re done using it.
The with statement uses an object called a “context manager”. This manager object can be assigned to a temporary variable and used in the with statement’s suite. See Managing Contexts: the with Statement for more information on creating a context manager.
The two central features are
with Statement Syntax. The with statement has the following syntax.
with expression as variable : suite
A file object conforms to the context manager interface. It has an __enter__() and a __exit__() method. It will be closed at the end of the with statement.
Generally, we use this as follows.
with file("somefile","r") as source: for line in source: print line
At the end of the with statement, irrespective of any exceptions which are handled – or not handled – the file will be closed and the relevant resources released.
Read Methods. The following methods read from a file. As data is read, the file position is advanced from the beginning to the end of the file. The file must be opened with a mode that includes or implies 'r' for these methods to work.
Read the next line. If size is negative or omitted, the next complete line is read. If the size is given and positive, read as many as size characters from the file; an incomplete line can be read.
If a complete line is read, it includes the trailing newline character, \n. If the file is at the end, this will return a zero length string.
If the file has a blank line, the blank like will be a string of length 1 (the newline character at the end of the line.)
Write Methods. The following methods write to a file. As data is written, the file position is advanced, possibly growing the file. If the file is opened for write, the position begins at the beginning of the file. If the file is opened for append, the position begins at the end of the file. If the file does not already exist, both writing and appending are equivalent. The file must be opened with a mode that includes 'a' or 'w' for these methods to work.
Position Control Methods. The current position of a file can be examined and changed. Ordinary reads and writes will alter the position. These methods will report the position, and allow you to change the position that will be used for the next operation.
Change the position from which the file will be processed. There are three values for whence which determine the direction of the move. If whence is zero (or omitted), move to the absolute position given by offset. f.seek(0) will rewind file f.
If whence is 1, move relative to the current position by offset bytes. If offset is negative, move backwards; otherwise move forward.
If whence is 2, move relative to the end of file. f.seek(0,2) will advance file f to the end, making it possible to append to the file.
Other Methods. These are additional useful methods of a file object.
Some handy attributes of a file.
We’ll look at four examples of file processing. In all cases, we’ll read simple text files. We’ll show some traditional kinds of file processing programs and how those can be implemented using Python.
The following program will examine a standard unix password file. We’ll use the explicit readline() method to show the processing in detail. We’ll use the split() method of the input string as an example of parsing a line of input.
pswd = file( "/etc/passwd", "r" ) for aLine in pswd fields= aLine.split( ":" ) print fields, fields pswd.close()
For non-unix users, a password file looks like the following:
root:q.mJzTnu8icF.:0:10:Sysadmin:/:/bin/csh fred:6k/7KCFRPNVXg:508:10:% Fredericks:/usr2/fred:/bin/csh
The : separated fields inlcude the user name, password, user id, group id, user name, home directory and shell to run upon login.
This file will have a CSV (Comma-Separated Values) file format that we will parse. The csv module does a far better job than this little program. We’ll look at that module in Comma-Separated Values: The csv Module.
A popular stock quoting service on the Internet will provide CSV files with current stock quotes. The files have comma-separated values in the following format:
stock, lastPrice, date, time, change, openPrice, daysHi, daysLo, volume
The stock, date and time are typically quoted strings. The other fields are numbers, typically in dollars or percents with two digits of precision. We can use the Python eval() function on each column to gracefully evaluate each value, which will eliminate the quotes, and transform a string of digits into a floating-point price value. We’ll look at dates in Dates and Times: the time and datetime Modules.
This is an example of the file:
"^DJI",10623.64,"6/15/2001","4:09PM",-66.49,10680.81,10716.30,10566.55,N/A "AAPL",20.44,"6/15/2001","4:01PM",+0.56,20.10,20.75,19.35,8122800 "CAPBX",10.81,"6/15/2001","5:57PM",+0.01,N/A,N/A,N/A,N/A
The first line shows a quote for an index: the Dow-Jones Industrial average. The trading volume doesn’t apply to an index, so it is “N/A”. The second line shows a regular stock (Apple Computer) that traded 8,122,800 shares on June 15, 2001. The third line shows a mutual fund. The detailed opening price, day’s high, day’s low and volume are not reported for mutual funds.
After looking at the results on line, we clicked on the link to save the results as a CSV file. We called it quotes.csv. The following program will open and read the quotes.csv file after we download it from this service.
qFile= file( "quotes.csv", "r" ) for q in qFile: try: stock, price, date, time, change, opPrc, dHi, dLo, vol\ = q.strip().split( "," ) print eval(stock), float(price), date, time, change, vol except ValueError: pass qFile.close()
We open our quotes file, quotes.csv, for reading, creating an object named qFile.
We use a for statement to iterate through the sequence of lines in the file.
The quotes file typically has an empty line at the end, which splits into zero fields, so we surround this with a try statement. The empty line will raise a ValueError exception, which is caught in the except clause and ignored.
Each stock quote, q, is a string. We use the string.strip() method to remove whitespace; on the resulting string we use the string.split() method to split the string on ",". This transforms the input string into a list of individual fields.
We use multiple assignment to assign each field to a relevant variable. Note that we strip this file into nine fields, leading to a long statement. We put a \ to break the statement into two lines.
The name of the stock is a string which includes extra quotes. In order to gracefully remove the quotes, we use the eval() function.
The price is a string. We could also use eval function to evaluate this string as a Python value. Instead, we use the float() function to convert the price string to a proper numeric value for further processing.
As a practical matter, this is a currency value, and we need to use a Decimal value, not a float value. The decimal module handles currency very nicely.
For COBOL expatriates, here’s an example that shows a short way to read a file into an in-memory sequence, sort that sequence and print the results. This is a very common COBOL design pattern, and it tends to be rather long and complex in COBOL.
data=  with file( "quotes.csv", "r" ) as qFile: for q in qFile: fields= tuple( q.strip().split( "," ) ) if len(fields) == 9: data.append( fields ) def priceVolume(a): return a, a data.sort( key=priceVolume ) for stock, price, date, time, change, opPrc, dHi, dLo, vol in data: print stock, price, date, time, change, volume
We create an empty sequence, data, to which we will append tuples created from splitting each line into fields.
We create file object, qFile that will read all the lines of our CSV-format file.
This for loop will set q to each line in the file.
The variable fields is created by stripping whitespace from the line, q, breaking it up on the "," boundaries into separate fields, and making the resulting sequence of field values into a tuple.
If the line has the expected nine fields, the tuple is appended to the data, sequence. Lines with the wrong number of fields are typically the blank lines at the beginning or end of the file.
To prepare for the sort, we define a key function. This will extract fields 1 and 8, price and volume.
If we don’t use a key function, the tuple will be sorted by fields in order. The first field is stock name.
We can then sort the data sequence. Note that the list.sort() method does not return a value. It mutates the list.
The sort method will use our priceVolume() function to extract keys for comparing records. This kind of sort is covered in depth in Advanced List Sorting.
Once the sequence of data elements is sorted, we can then print a report showing our stocks ranked by price, and for stocks of the same price, ranked by volume. We could expand on this by using the % operator to provide a nicer-looking report format.
Note that we aren’t obligated to sort the sequence. We can use the sorted() function here, also.
for stock, price, date, time, change, opPrc, dHi, dLo, vol \ in sorted( data, key=priceVolume ): print stock, price, date, time, change, volume
This does not update the data list, but is otherwise identical.
In languages like C or COBOL a “record” or “struct” will describe the contents of a file. The advantage of a record is that the fields have names instead of numeric positions. In Python, we can acheive the same level of clarity using a dict for each line in the file.
For this, we’ll download files from a web-based portfolio manager. This portfolio manager gives us stock information in a file called display.csv. Here is an example.
+/-,Ticker,Price,Price Change,Current Value,Links,# Shares,P/E,Purchase Price, -0.0400,CAT,54.15,-0.04,2707.50,CAT,50,19,43.50, -0.4700,DD,45.76,-0.47,2288.00,DD,50,23,42.80, 0.3000,EK,46.74,0.30,2337.00,EK,50,11,42.10, -0.8600,GM,59.35,-0.86,2967.50,GM,50,16,53.90,
This file contains a header line that names the data columns, making processing considerably more reliable. We can use the column titles to create a dict for each line of data. By using each data line along with the column titles, we can make our program quite a bit more flexible. This shows a way of handling this kind of well-structured information.
invest= 0 current= 0 with open( "display.csv", "rU" ) as quotes: titles= quotes.next().strip().split( ',' ) for q in quotes: values= q.strip().split( ',' ) data= dict( zip(titles,values) ) print data invest += float(data["Purchase Price"])*float(data["# Shares"]) current += float(data["Price"])*float(data["# Shares"]) print invest, current, (current-invest)/invest
We open our portfolio file, display.csv, for reading, creating a file object named quotes.
The first line of input, varname.next(), is the set of column titles. We strip any extraneous whitespace characters from this line, and then perform a split to create a list of individual column title strs. This list is tagged with the variable titles.
We also initialize two counters, invest and current to zero. These will accumulate our initial investment and the current value of this portfolio.
We use a for statement to iterate through the remaining lines in quotes file. Each line is assigned to q.
We create a dict, data; the column titles in the titles list are the keys. The data fields from the current record, in values are used to fill this dict. The built-in zip() function is designed for precisely this situation: it interleaves values from each list to create a new list of tuples.
Now, we have access to each piece of data using it’s proper column tile. The number of shares is in the column titled "# Shares". We can find this information in data["# Shares"].
We perform some simple calculations on each dict. In this case, we convert the purchase price to a number, convert the number of shares to a number and multiply to determine how much we spent on this stock. We accumulate the sum of these products into invest.
We also convert the current price to a number and multiply this by the number of shares to get the current value of this stock. We accumulate the sum of these products into current.
When the loop has terminated, we can write out the two numbers, and compute the percent change.
File Structures. What is required to process variable length lines of data in an arbitrary (random) order? How is the application program to know where each line begins?
Device Structures. Some disk devices are organized into cylinders and tracks instead of blocks. A disk may have a number of parallel platters; a cylinder is the stack of tracks across the platters available without moving the read-write head. A track is the data on one circular section of a single disk platter. What advantages does this have? What (if any) complexity could this lead to? How does an application program specify the tracks and sectors to be used?
Some disk devices are described as a simple sequence of blocks, in no particular order. Each block has a unique numeric identifier. What advantages could this have?
Some disk devices can be partitioned into a number of “logical” devices. Each partition appears to be a separate device. What (if any) relevance does this have to file processing?
Portfolio Position. We can create a simple CSV file that contains a description of a block of stock. We’ll call this the portfolio file. If we have access to a spreadsheet, we can create a simple file with four columns: stock, shares, purchase date and purchase price. We can save this as a CSV file.
If we don’t have access to a spreadsheet, we can create this file in IDLE. Here’s an example line.
stock,shares,"Purchase Date","Purchase Price" "AAPL", 100, "10/1/95", 14.50 "GE", 100, "3/5/02", 38.56
We can read this file, multiply shares by purchase price, and write a simple report showing our initial position in each stock.
Note that each line will be a simple string. When we split this string on the ,’s (using the string split() method) we get a list of strings. We’ll still need to convert the number of shares and the purchase price from strings to numbers in order to do the multiplication.
Aggregated Portfolio Position. In Porfolio Position we read a file and did a simple computation on each row to get the purchase price. If we have multiple blocks of a given stock, these will be reported as separate lines of detail. We’d like to combine (or aggregate) any blocks of stock into an overall position.
Programmers familiar with COBOL (or RPG) or similar languages often use a Control-Break reporting design which sorts the data into order by the keys, then reads the lines of data looking for break in the keys. This design uses very little memory, but is rather slow and complex.
It’s far simpler to use a Python dictionary than it is to use the Control-Break algorithm. Unless the number of distinct key values is vast (on the order of hundreds of thousands of values) most small computers will fit the entire summary in a simple dictionary.
A program which produces summaries, then, would have the following design pattern.
Some people like to see the aggregates sorted into order. This is a matter of using sorted() to iterate through the dictionary keys in the desired order to write the final report.
Portfolio Value. In Reading a CSV File the Hard Way, we looked at a simple CSV-format file with stock symbols and prices. This file has the stock symbol and last price, which serves as a daily quote for this stock’s price. We’ll call this the stock-price file.
We can now compute the aggregate value for our portfolio by extracting prices from the stock price file and number of shares from the portfolio file.
If you’re familiar with SQL, this is called a join operation ; and most databases provide a number of algorithms to match rows between two tables. If you’re familiar with COBOL, this is often done by creating a lookup table, which is an in-memory array of values.
We’ll create a dictionary from the stock-price file. We can then read our portfolio, locate the price in our dictionary, and write our final report of current value of the portfolio. This leads to a program with the following design pattern.
Load the price mapping from the stock-price file.
In the case of a stock with no price, the program should produce a “no price quote” line in the output report. It should not produce a KeyError exception.