We’ll look at a quick examples of processing with the csv module in The csv Module.
We’ll look at a number of typical file-processing design patterns in this section.
Throughout these examples, we’ll be dealing with stock and mutual fund information that we downloaded from the internet. You can find this kind of information on http://finance.yahoo.com.
Also, we are going to use simple floating-point numbers to represent dollar amounts. This will give us answers which are good enough for now. Later (in Fixed-Point Numbers : Doing High Finance with decimal), we’ll introduce the decimal module, which we can use to produce correct results.
After these examples, we’ll look at several file-handling modules in Additional File-Related Modules.
One common and useful file format is the Comma-Separated Values (CSV) format. CSV files use a , to separate values. If a value has a , in it, the value is quoted, usually with ". If a value has " in it, the " characters are doubled.
Here’s an example file.
"First","Middle","Last","Nickname","Email","Category" "Moe","","Howard","Moe","firstname.lastname@example.org","actor" "Jerome","Lester","Howard","Curly","email@example.com","actor" "Larry","","Fine","Larry","firstname.lastname@example.org","musician" "Jerome","","Besser","Joe","email@example.com","actor" "Joe","","DeRita","CurlyJoe","firstname.lastname@example.org","actor" "Shemp","","Howard","Shemp","email@example.com","actor"
We’ll look at reading CSV files using the csv module. This is chapter 9 of the Python Library Reference [PythonLib] This module gives us a handy definition called a reader which will extract individual records from the file, properly match up the "‘s, and correctly split fields on the ,‘s.
The csv.reader() function is an iterator object that both gets individual lines from the file and does all of the necessary decoding for us. We can use this CSV iterator with the for statement to correctly parse every line from the file.
When we use our spreadsheet software to save a CSV file, we have to open it with a mode of "rb".
import csv naFile= file( "name_addr.csv", "rb" ) rdr= csv.reader( naFile ) for person in rdr: print person, person, person naFile.close()
When you run this program, you’ll notice that the header line in our file is being processed as if it were data. We’d like to skip past this gracefully. Since rdr is an iterator, we can use rdr.next() to get the first line from the file.
import csv naFile= file( "name_addr.csv", "rb" ) rdr= csv.reader( naFile ) header= rdr.next() for person in rdr: print person, person, person naFile.close()
Here’s another example that reads a CSV (Comma-Separated Values) file format. A popular stock quoting service on the Internet will provide CSV files with current stock quotes. Here’s an example of the file that we downloaded.
"^DJI",10623.64,"6/15/2001","4:09PM",-66.49,10680.81,10716.30,10566.55,N/A "AAPL",20.44,"6/15/2001","4:01PM",+0.56,20.10,20.75,19.35,8122800 "CAPBX",10.81,"6/15/2001","5:57PM",+0.01,N/A,N/A,N/A,N/A
The stock, date and time are quoted strings. The other fields are generally numbers, typically in dollars or percents with two digits of precision. There are a few exceptions to this format for indexes and mutual funds.
This is a very old example of the file. The prices of these stocks may have changed, but the file format hasn’t changed one bit.
The first line shows a quote for an index: the Dow-Jones Industrial average. The trading volume doesn’t apply to an index, so it is N/A, without quotes. The second line shows a regular stock (Apple Computer) that traded 8,122,800 shares on June 15, 2001. The third line shows a mutual fund. The detailed opening price, day’s high, day’s low and volume are not reported for mutual funds.
After looking at the results on line, we clicked on the link to save the results as a CSV file. We called it quotes.csv. The following program will open and read the quotes.csv file after we download it from this service.
#!/usr/bin/env python import csv qFile= file( "quotes.csv", "r" ) qRdr= csv.reader( aFile ) for quote in qRdr: stock, price, dt, tm, chg, opn, dHi, dLo, vol = quote print stock, price, dt, tm, chg, vol qFile.close()
This example shows a short way to read, sort and write a file.
We can easily sort data in a list, using the sort() method function. So, our solution must first read the data, creating a list. We can sort the list, then write the list in sorted order for processing by another program.
In this case, we’ll sort our stock quotes by company, the first field in each quote record. For simplicity we’ll write the sorted CSV file to sys.stdout. We’ll look at some extensions to this program to sort by different fields and write to a different output file.
#!/usr/bin/env python import csv qRdr= csv.reader( file( "quotes.csv", "r" ) ) data= [ tuple(quote) for quote in qRdr ] def name(quote): return quote data.sort( key=name ) for q in data: print q, q, q, q
Debugging CSV Input
One problem with file processing is that our Python data structure isn’t a giant string of characters. However, the file is simply a giant string. Essentially, reading a file is a way of translating the characters into a useful Python structure.
The most common thing that can go wrong is not creating the expected structure in our Python program. In the Reading and Sorting example, we might not create our list of tuples correctly.
It is helpful to print the value of the data variable to get a good look at the data structure which is produced. Here we show the beginning of our “list of tuples”. We’ve adjusted the Python output to make it a little more readable.
[('"^DJI"', '10452.15', '"9/26/2005"', '"1:50pm"', '+32.56', '10420.22', '10509.23', '10420.22', '137206720'), ('"^IXIC"', '2121.07', '"9/26/2005"', '"1:50pm"', '+4.23', '2127.90', '2132.60', '2119.17', '0'), ...
Looking at the intermediate results helps us be sure that we are reading the file properly.
A more interesting modification is to add various function definitions for different sorts. For instance, if we wanted to sort by price (field 1), we could make the following change. We can define any number of functions and use one of them in the sort() method function.
def name(quote): return quote def price(quote): return quote data.sort( key=price )
Bonus Question. Why did we add the calls to the built-in function float()? What happens if we take those function calls out? What is the difference between comparing strings of digits and comparing numeric values? For review, see Sorting a List: Expanding on the Rules.
This example uses data that we downloaded from a web-based portfolio manager. This portfolio manager’s stock information comes in a file format that includes an extra header line with column titles in it. This file is called dwnld_portinfo.csv. Here is an example.
"",TICKER,"PRICE","PRICE CHANGE","# SHARES","P/E","PURCHASE PRICE","PURCHASE PRICE" "","CAT","58.34","-0.58","50","17.26","43.5","43.5" "","DD","38.63","-0.15","50","14.97","42.8","42.8" "","EK","25.81","0.3","50","0","42.1","42.1" "","GM","31.15","0.08","50","0","53.9","53.9" "","USD",--,--,--,--,--,-- "Totals:","","","","","","",""
This file contains a header line that names the data columns, making processing much more reliable. If the web site adds a field or changes the order of the fields, we can use this column title information to assure that our program doesn’t need to be changed.
We can use the column titles to create a dictionary for each line of data. By making a dictionary of each line, we can identify each piece of data by the column name, not by the position. Identifying data by column name is generally more clear. It’s also immune the column order.
This file has two lines of junk that we want to gracefully ignore. First, it has a trailing “USD” line, which shows the cash position of the portfolio. Second, it has a “Totals:” line which doesn’t seem to have anything. We’ll need to discard these two lines.
#!/usr/bin/env python import csv pFile= file( "dwnld_portinfo-3.csv", "r" ) pDictRdr= csv.DictReader( pFile ) invest= 0 current= 0 for posn in pDictRdr: if posn[""] == "Totals:": continue if posn[""] == "Totals:": continue if posn["TICKER"] == "USD": continue print posn invest += float(posn["PURCHASE PRICE"])*float(posn["# SHARES"]) current += float(posn["PRICE"])*float(posn["# SHARES"]) pFile.close() print invest, current, (current-invest)/invest
We open our portfolio position file for reading, creating an object named pFile.
We use our input file, pFile to create a csv.DictReader. This reader will do three things: it will match up " characters, split fields on , characters, and use the first line of the file as keys to create a dictionary.
Each row will be a dictionary. The key will be the column header, and the value will be this row’s data value.
We also initialize two counters, invest and current to zero. These will accumulate our initial investment and the current value of this portfolio.
We use a for statement to iterate through the positions in the file. Each position will be a dictionary, assigned to the variable posn.
We can get each field’s value using the column title. For example, we get the ticker symbol using posn["TICKER"].
Our first piece of processing is a filter. The totals line has the value Totals: in the unnamed column. We’ll ignore the totals line at the end (posn[""] == "Totals:") by continuing the loop. The cash position has a ticker symbol of USD. We’ll ignore the cash position (posn["TICKER"] == "USD") by continuing the loop.
Our second piece of processing is some simple calculations. In this case, we convert the purchase price to a number, convert the number of shares to a number and multiply to determine how much we spent on this stock. We accumulate the sum of these products into invest.
We also convert the current price to a number and multiply this by the number of shares to get the current value of this stock. We accumulate the sum of these products into current.
When the loop has terminated, we can close the file, write out the two numbers, and compute the percent change.
The conversion to a dictionary makes our “business rules” relatively easy to read.
If we wanted to be really precise, we could say things like the following to separate the issue of identifying the cash position line from the processing for the cash position line. The boolean variable cashPosition is set to True when we identify the cash position line in the file.
cashPosition= data["TICKER"] == "USD" if cashPosition: continue
Additionally, we could make the processing more clear by expanding it into the following. We would separate the conversion from string to number from the calculation using that number.
shares= float(data["# SHARES"]) purch= float(data["PURCHASE PRICE"]) invest += shares * purch
There are a number of operations closely related to file processing. Deleting and renaming files are examples of operations that change the directory information that the operating system maintains to describe a file. Python provides modules for these operating system file management operations.
The Python Library Reference [PythonLib] has three chapters and over two dozen modules that are useful for working with files. We’ll highlight a few.
26. Python Runtime Services. There are a number of modules described in this section. We want to emphasize just one, sys.
The sys module provides access to some objects used or maintained by the interpreter and to functions that interact with the interpreter.
Most importantly, the sys module provides access the three standard OS files used by Python.
11. File and Directory Access. These are a larger number of very useful modules for working with files and directories.
The os.path module contains operating-system agnostic functions for managing path and directory Names. Since these functions are tailored for each operating system, this is the best way to assure portability of your program.
The os.path module helps us parse and create correct file names. This module addresses the most obvious differences among operating systems: the way that files are named. In particular, the path separator can be either the POSIX standard /, or the windows \. Additionally, there’s a MacOS Classic mode that can also use :. Rather than make each program aware of the operating system rules for path construction, Python provides the os.path module to make all of the common filename manipulations completely consistent.
A serious mistake is to use ordinary string functions with literals for the path separators. For example, a program using ` as the separator will only work on Windows, and won't work anywhere else. A less serious mistake is to use :varname:`os.pathsep. The best approach is to use the functions in the os.path module.
The os.path module contains the following functions for completely portable path and filename manipulation.
Return the base filename, the second half of the result created by os.path.split( path )
>>> import os >>> fn='/Users/slott/Documents/Writing/NonProg2.5/notes/portfolio.py' >>> os.path.basename( fn ) 'portfolio.py'
Return the directory name, the first half of the result created by os.path.split( path )
>>> import os >>> fn='/Users/slott/Documents/Writing/NonProg2.5/notes/portfolio.py' >>> os.path.dirname(fn) '/Users/slott/Documents/Writing/NonProg2.5/notes'
Return the last access time of a file, reported by os.stat(). See the time module for functions to process the time value.
>>> import os >>> import time >>> fn='/Users/slott/Documents/Writing/NonProg2.5/notes/portfolio.py' >>> os.path.getatime( fn ) 1246637163.0 >>> time.ctime(_) 'Fri Jul 3 12:06:03 2009'
Return the size of a file, in bytes, reported by os.stat().
>>> import os >>> fn='/Users/slott/Documents/Writing/NonProg2.5/notes/portfolio.py' >>> os.path.getsize( fn ) 175L
Join path components using the appropriate path separator. This is the best way to assemble long path names from component pieces. It is operating-system independent, and understands all of the operating system’s punctuation rules.
>>> import os >>> os.path.join( '/Users', 'slott', 'Documents', 'Writing' ) '/Users/slott/Documents/Writing'
Split a pathname into two parts: the directory and the basename (the filename, without path separators, in that directory). The result (s, t) is such that os.path.join( s, t ) yields the original path.
>>> import os >>> fn='/Users/slott/Documents/Writing/NonProg2.5/notes/portfolio.py' >>> os.path.split( fn ) ('/Users/slott/Documents/Writing/NonProg2.5/notes', 'portfolio.py')
Split a path into root and extension. The extension is everything starting at the last dot in the last component of the pathname; the root is everything before that. The result tuple ( root , ext ) is such that root + ext yields the original path.
>>> import os >>> fn='/Users/slott/Documents/Writing/NonProg2.5/notes/portfolio.py' >>> dir, file = os.path.split(fn) >>> os.path.splitext( file ) ('portfolio', '.py')
The following example is typical of the manipulations done with os.path:
import sys, os.path def process( oldName, newName ): Some Processing... for oldFile in sys.argv[1:]: dir, fileext= os.path.split(oldFile) file, ext= os.path.splitext( fileext ) if ext.upper() == ‘.HTML’: newFile= os.path.join( dir, file ) + ‘.BAK’ print oldFile, newFile process( oldFile, newFile )
This program imports the sys and os.path modules. The variable oldFile is set to each file name that is listed in the sequence sys.argv by the for statement.
Each file name is split into the path name and the base name. The base name is further split to separate the file name from the extension. The os.path does this correctly for all operating systems, saving us having to write platform-specific code. For example, splitext() correctly handles the situation where a Linux file has multiple ‘.’s in the file name.
The extension is tested to be '.HTML'. The processing only applies to these files. A new file name is joined from the path, base name and a new extension ('.BAK'). The old and new file names are printed and some processing, defined in the process(), uses the oldFile and newFile names.
One common problem is to open a unique temporary file to hold intermediate results; Python supports this with the tempfile module. The tempfile module includes a function, mkstemp() which creates a unique temporary file name. Temporary files must be explicitly deleted with os.remove().
When writing web applications – where your server is executing multiple, concurrent transactions – you’ll need this function to create temporary files that are private to each web user. For ordinary single-user applications that run on a desktop PC, this module isn’t often necessary.
Create a secure temporary file. If suffix is specified, this is the end of the name. If you want this to be the extension, like .tmp, you must include the .. If prefix is specified, this is the beginning of the file name. If dir is specified, this is the directory in which the file will be created. Otherwise a suitable default directory is used, based on the tempfile.tempdir variable, environment variables and platform-specific alternatives locations. The text determines of the file is opened in text or binary mode.
The tuple returned is ( fd , name ), which is an Operating System file description number and the string filename. The file description can be used with os.fdopen() to create a proper Python file. The filename can be used with os.unlink() to remove the temporary file when you are done with it.
Here’s an example of how we can use tempfile.mkstemp() to create a file. We’ll use this file to store some intermediate results. When the program is done, we’ll remove the file.
import tempfile, os tempFD,tempName= tempfile.mkstemp( ‘.tmp’ ) temp= os.fdopen( tempFD, ‘w+’ ) Some Processing... temp.close() os.unlink( tempName )
This fragment will create a unique temporary file name with an extension of .tmp. Since the name is guaranteed to be unique, this can be used without fear of damaging or overwriting any other file. After the processing, the file is removed.
Here’s an example of some processing we might do within this framework. This fragment writes 100 random dice rolls to the file and then reads those 100 random dice rolls and averages them.
# write to the temp file for i in range(100): temp.write( "%d %d\n" % (random.randrange(1,7),random.randrange(1,7)) ) temp.flush() # rewind the temp file temp.seek( 0 ) # read from the temp file sum= 0 for dice in temp: d1, d2 = map( int, dice.split() ) sum += d1+d2 print sum/100.0
The shutil module automates copying entire files or directories. This saves the steps of opening, reading, writing and closing files when there is no actual processing, simply moving files.
When we have complex programs that need to preserve a backup copy of a file or rename a file, we have two choices for our design.
The GNU/Linux shell expands wild-cards to complete lists of file names; the verb is to glob (really). The glob module makes the name globbing capability available to Windows programmers. The glob module includes the following function that locates all names which match a given pattern.
Return a list of filenames that match the given wild-card pattern. The fnmatch module is used for the wild-card pattern matching.
A common use for glob is something like the following.
import glob, sys for wildcard in sys.argv[1:]: for f in glob.glob(wildcard): process( f )
This can make Windows programs process command line arguments somewhat like Unix programs. Each argument is passed to glob.glob() to expand any patterns into a list of matching files. If the argument is not a wild-card pattern, glob simply returns a list containing this one file name.
The fnmatch module has the essential algorithm for matching a wild-card pattern against file names. This module implements the Unix shell wild-card rules. These rules are used by glob to locate all files that match a given pattern. The module contains the following function:
Return True if the filename string matches the pattern string.
The patterns use * to match any number of characters, ? to match any single character. [letters] matches any of these letters, and [!letters] matches any letter that is not in the given set of letters.
>>> import fnmatch >>> fnmatch.fnmatch('greppy.py','*.py') True >>> fnmatch.fnmatch('README','*.py') False
14. Generic OS Services. This chapter describes a number of modules that are specifically designed to be the same in Linux, Mac OS and Windows. By using this module, you can be assured that your Python program will work the same everywhere.
The os module contains an interface to many operating system-specific functions that manipulate processes, files, file descriptors, directories and other operating system resources. This module is specific to the operating system. Programs that import and use os stand a better chance of being portable between different platforms. Portable programs must depend only on functions that are supported for all platforms (e.g., unlink() and opendir()), and leave all pathname manipulation to os.path.
The os module exports a number of things. These constants are like variables, but changing their value will not have any beneficial effects on your program. The following definitions in this module provide useful information about the operating system.
The (or the most common) pathname separator character ( / generally, \ on Windows). Most of the Python library routines will translate the standard / for use on Windows.
It is better to use the os.path module to construct or parse path names.
Change the current working directory to path.
import os os.chdir( "/Volumes/Slott02/Writing/Tech/PFNP/Notes" )
Return the current working directory path.
import os print os.getcwd()
Source Lines Of Code.
One measure of the complexity of an application is the count of the number of lines of source code. Often, this count discards comment lines. We’ll write an application to read Python source files, discarding blank lines and lines beginning with #, and producing a count of source lines.
We’ll develop a function to process a single file and count the lines of code. Once we can process a single file, we can then use the glob module to locate all of the *.py files in a given directory and process each file.
Develop a fileLineCount(name) function which opens a file with the given name and examines all of the lines of the file. Each line should have strip() applied to remove leading and trailing spaces. If the resulting line is of length zero, it was effectively blank, and can be skipped. If the resulting line begins with # the line is entirely a comment, and can be skipped. All remaining lines should be counted, and fileLineCount(name) returns this count.
Develop a directoryLineCount(path) function which uses the path with the glob.glob() to expand all matching file names. Each file name is processed with fileLineCount(name) to get the number of non-comment source lines. Write this to a tab-delimited file; each line should have the form “filename \t lines“
For a sample application to analyze, look in your Python distribution for Lib/idelib/*.py.
Summarize a Tab-Delimited File.
The previous exercise produced a file where each line has the form “filename \t lines“. Read this file, producing a nicer-looking report that has column titles, file and line counts, and a total line count at the end of the report.
You should make liberal use of the string % operator for formatting the output.
File Processing Pipeline.
The previous two exercises produced programs which can be part of a processing pipeline. The first exercise should produce it’s output on sys.stdout. The second exercise should gather it’s input from sys.stdin. Once this capability is in place, the pipeline can be invoked using a command like the following:
$ python lineCounter.py :replaceable:`path` | python lineSummary.py
This is an important “fit and finish” issue for GNU/Linux programs. A well-behaved program can use sys to get argument values so that an names of files or directories are not “hard-coded” into the program. Additionally we should always use sys.stdout and sys.stdin to make it easy to reuse programs.