All programs must deal with external data. They will either accept data from sources outside the text of the program, or they will produce some kind of output, or they will do both. Think about it: if the program produces no output, how do you know it did anything?
By external data, we mean data outside of volatile, high-speed, primary memory; we mean data on peripheral devices. This may be persistent data on a disk, or transient data on a network interface. For now, it may mean transient data displayed on our terminal.
Most operating systems provide simple, uniform access to external data via the abstraction called a file. We’ll look at the operating system implementation, as well as the Python class that gives us access to the operating system file in our programs.
In File Objects – Our Connection To The File System, we provide definitions of how Python works with files. We cover the built-in functions for working with files in The File and Open Functions. In Methods We Use on File Objects, we describe some method functions of file objects. We’ll look at file-processing statements in File Statements: Reading and Writing (but no Arithmetic).
Abstractions Built on Top of Abstractions. Files do a huge number of things for us. To support this broad spectrum of capabilities, there are two layers of abstraction involved: the OS and Python. Unfortunately, both layers use the same words, so we have to be careful about casually misusing the word “file”.
The operating system has devices of various kinds. All of the various devices are unified using a common abstraction that we call the file system. All of a computer’s devices appear as OS files of one kind or another. Some things which aren’t physical devices also appear as files. Files are the plumbing that move data around our information infrastructure.
Additionally, Python defines file objects. These file objects are the fixtures that give our Python program access to OS files.
The following figure shows this technology stack. Your program makes use of Python File objects. Python, in turn, makes use of OS file objects. Yes, it can be confusing that “file” is used for both things. However, you only have to focus on the Python file; the rest is just infrastructure to support your needs.
How Files Work. When your program evaluates a method function of a Python file object, Python transforms this into an operation on the underlying OS file. An OS file operation becomes an operation on one of the various kinds of devices attached to our computer. Or, a OS file operation can become a network operation that reaches through the Internet to access data from remote computers. The two layers of abstraction mean that one Python program can do a wide variety of things on a wide variety of devices.
In Python, we create a file object to work with files in the file system. In addition to files in the OS’s file system, Python recognizes a spectrum of file-like objects, including abstractions for network interfaces called pipes and sockets and even some kind of in-memory buffers.
Unlike sequences, sets and mappings, there are no Python literals for file objects. Lacking literals, we create a file object using the file() or open() factory function. We provide two pieces of information to this function. We can provide a third, optional, piece of information that may improve the performance of our program.
The name of the file. The operating system will interpret this name using its “working directory” rules. If the name starts with / (or device:\) it’s an absolute name. Otherwise, it’s a relative name; the current working directory plus this name identifies the file.
Python can translate standard paths (using /) to Windows-specific paths. This saves us from having to really understand the differences. We can name all of our files using /, and avoid the messy details.
We can, if we want, use raw strings to specify Windows path names using the \ character.
The access mode for the file. This is some combination of read, write and append. The mode can also include instructions for interpreting the bytes as characters.
Optionally, we can include the buffering for the file. Generally, we omit this. If the buffering argument is given, 0 means each byte is transferred as it is read or written. A value of 1 means the data is buffered a line at a time, suitable for reading from a console, or writing to an error log. Larger numbers specify the buffer size: numbers over 4,096 may speed up your program.
Once we create the file object, we can do operations to read characters from the file or write characters to the file. We can read individual characters or whole lines. Similarly, we can write individual characters or whole lines.
When Python reads a file as a sequence of lines, each line will become a separate string. The '\n' character is preserved at the end of the string. This extra character can be removed from the string using the rstrip() method function.
A file object (like a sequence) can create an iterator which will yield the individual lines of the file. You can, consequently, use the file object in a for statement. This makes reading text files very simple.
When the work is finished, we also need to use the file’s close() method. This empties the in-memory buffers and releases the connection with the operating system file. In the case of a socket connection, this will release all of the resources used to assure that data travels through the Internet successfully.
The filename is the name of the file. This is simply given to the operating system. The OS expects eitther absolute or relative paths; the operating system folds in the current working directory to relative paths.
The mode is covered in detail below. In can be 'r', 'w' or 'a' for reading (default), writing or appending. If the file doesn’t exist when opened for writing or appending, it will be created. If a file existed when opened for writing, it will be truncated and overwritten. Add a 'b' to the mode for binary files. Add a '+' to the mode to allow simultaneous reading and writing.
If the buffering argument is given, 0 means unbuffered, 1 means line buffered, and larger numbers specify the buffer size.
Python expects the POSIX standard punctuation of / to separate elements of the filename path for all operating systems. If necessary, Python will translate these standard name strings to the Windows punctuation of \. Using standardized punctuation makes your program portable to all operating systems. The os.path module has functions for creating valid names in a way that works on all operating systems.
Constructing File Names
When using Windows-specific punctuation for filenames, you’ll have problems because Python interprets the \ as an escape character. To create a string with a Windows filename, you’ll either need to use \ in the string, or use an r" " string literal. For example, you can use any of the following: r"E:\writing\technical\pythonbook\python.html" or "E:\\writing\\technical\\pythonbook\\python.html".
Note that you can often use "E:/writing/technical/pythonbook/python.html". This uses the POSIX standard punctuation for files paths, /, and is the most portable. Python generally translates standard file names to Windows file names for you.
Generally, you should either use standard names (using /) or use the os.path module to construct filenames. This module eliminates the need to use any specific punctuation. The os.path.join() function makes properly punctuated filenames from sequences of strings
The Mode String. The mode string specifies how the OS file will be accessed by your program. There are four separate issues addressed by the mode string: opening, bytes, newlines and operations.
Opening. For the opening part of the mode string, there are three alternatives:
|r:||Open for reading. Start at the beginning of the OS file. If the OS file does not exist, raise an IOError exception. This is the default.|
|w:||Open for writing. Start at he beginning of the OS file. If the OS file does not exist, create the OS file.|
|a:||Open for appending. Start at the end of the OS file. If the OS file does not exist, create the OS file.|
Bytes or Characters. For the byte handling part of the mode string, there are two alternatives:
|b:||The OS file is a sequence of bytes; do not interpret the file as a sequence of characters. This is suitable for .csv files as well as images, movies, sound samples, etc.|
The default, if b is not included, is to interpret the file is a sequence of ordinary characters. The Python file object will be an iterator that yields each individual line from the OS file as a separate string. Translations from various encoding schemes like UTF-8 and UTF-16 will be handled automatically.
Universal Newlines. The newline part of the mode string has two alternatives:
|U:||Universal newline interpretation. The first instance of \n, \r\n (or \r) will define the newline character(s). Any of these three newline sequences will be silently translated to the standard '\n' character. The \r\n is a Windows feature.|
The default, if U is not included, is to only handle this operating system’s standard newline character(s).
Mixed Operations. For the additional operations part of the mode string, there are two alternatives:
|+:||Allow both read and write operations to the OS file.|
The default, if + is not included, is to allow only limited operations: only reads for files opened with “r”; only writes for OS files opened with “w” or “a”.
Typical combinations include the following:
The following examples create Python file objects for further processing:
dataSource= open( "name_addr.csv", "rb" ) newPage= open( "addressbook.html", "w" ) theErrors= open( "/usr/local/log/error.log", "a" )
This example opens the existing file name_addr.csv in the current working directory for reading. The variable dataSource identifies this file object, and we can use this variable for reading strings from this file.
This file is opened in binary mode.
This example creates a new file addressbook.html (or it will truncate this file if it exists). The file will be in the current working directory. The variable newPage identifies the file object. We can then use this variable to write strings to the file.
This example appends to the file error.log (or creates a new file, if the file doesn’t exist). The file has the directory path /usr/local/log/. Since this is an absolute name, it doesn’t depend on the current working directory.
Buffering files is typically left as a default, specifying nothing. However, for some situations, adjusting the buffering can improve performance. Error logs, for instance, are often unbuffered, so the data is available immediately. Large input files may be opened with large buffer numbers to encourage the operating system to optimize input operations by reading a few large chunks of data from the device instead of a large number of smaller chunks.
There are a number of things that can go wrong in attempting to create a file object.
If the file name is invalid, you will get operating system errors. Usually they will look like this:
Traceback (most recent call last): File "<stdin>", line 1, in <module> IOError: [Errno 2] No such file or directory: 'wakawaka'
It is very important to get the file’s path completely correct. You’ll notice that each time you start IDLE, it thinks the current working directory is something like C:\Python26. You’re probably doing your work in a different default directory.
When you open a module file in IDLE, you’ll notice that IDLE changes the current working directory is the directory that contains your module. If you have your .py files and your data files all in one directory, you’ll find that things work out well.
The next most common error is to have the wrong permissions. This usually means trying to writing to a file you don’t own, or attempting to create a file in a directory where you don’t have write permission. If you are using a server, or a computer owned by a corporation, this may require some work with your system administrators to sort out what you want to do and how you can accomplish it without compromising security.
The [Errno 2] note in the error message is a reference to the internal operating system error numbers. There are over 100 of these error numbers, all collected into the module named errno. There are a lot of different things that can go wrong, many of which are very, very obscure situations.
The Python file object is our view of the underlying operating system file. The OS file, in turn, gives us access to a specific device.
Reading. The following read methods get data from the OS file. These operations may also change the Python file object’s internal status and buffers. For example, at end-of-file, the internal status of the file object will be changed. Most importantly, these methods have the very visible effect of consuming data from the OS file.
Read as many as size characters from file f as a single, large string. If size is negative or omitted, the rest of the file is read into a single string.
from __future__ import print_function dataSource= open( "name_addr.csv", "r" ) theData= dataSource.read() for n in theData.splitlines(): print(n) dataSource.close()
Read the next line or as many as size characters from file f; an incomplete line can be read. If size is negative or omitted, the next complete line is read. If a complete line is read, it includes the trailing newline character. If the file is at the end, f. readline() returns a zero length string. If the file has a blank line, this will be a string of length 1, just the newline character.
from __future__ import print_function dataSource= file( "name_addr.csv", "r" ) n= dataSource.readline() while len(n) > 0: print(n.rstrip()) n= dataSource.readline() dataSource.close()
Read the next lines or as many lines from the next hint characters from file f. The hint size may be rounded up to match an internal buffer size. If hint is negative or omitted, the rest of the file is read. All lines will include the trailing newline character. If the file is at the end, f. readlines() returns a zero length list.
When we simply reference a file object in a for statement, this is the function that’s used for iteration over the file.
dataSource= file( "name_addr.csv", "r" ) for n in dataSource: print(n.rstrip()) dataSource.close()
Writing. The following methods send data to the OS file. These operations may also change the Python file object’s internal status and buffers. Most importantly, these methods have the very visible effect of producing data to the OS file.
Flush all accumulated data from the internal buffers of file f to the device or interface. If a file is buffered, this can help to force writing of a buffer that is less than completely full. This is appropriate for log files, prompts written to sys.stdout and error messages.
Truncate file f. If size is not given, the file is truncated at the current position. If size is given, the file will be truncated at or before size. This function is not available on all platforms.
newPage= file( "addressbook.html", "w" ) newPage.write( "<html>\n<head><title>Hello World</title></head>\n<body>\n" ) newPage.write( "<p>Hello World</p>\n" ) newPage.write( "<\body>\n</html>\n" ) newPage.close()
newPage= file( "addressbook.html", "w" ) newPage.writelines( [ "<html>\n", "<head><title>Hello World</title></head>\n", "<body>\n" ] ) newPage.writelines( ["<p>Hello World</p>\n" ] ) newPage.writelines( [ "<\body>\n", "</html>\n" ] ) newPage.close()
Accessors. The following file accessors provide information about the file object.
Return the position from which file f will be processed. This is a partner to the seek() method; any position returned by the tell() method can be used as an argument to the seek() method to restore the file to that position.
Return the internal file descriptor (fd) number used by the OS library when working with file f. A number of modules provide access to these low-level libraries for advanced operations on devices and files.
Return True if file f is connected to an OS file that is a console or keyboard.
This attribute of file f is True if the file is closed.
This attribute is the mode argument to the file() function that was used to create the file object.
This attribute of file f is the filename argument to the file() function that was used to create the file object.
Transfomers. The following file transforms change the file object itself. This includes closing it (and releasing all OS resources) or change the position at which reading or writing happens.
Close file f. The closed flag is set. Any further operations (except a redundant close) raise an IOError exception.
Change the position from which file f will be processed. There are three values for whence which determine the direction of the move.
If whence is 0 (the default), move to the absolute position given by offset. f.seek(0) will rewind file f.
If whence is 1, move relative to the current position by offset bytes. If offset is negative, move backwards; otherwise move forward.
If whence is 2, move relative to the end of file. f.seek(0,2) will advance file f to the end.
A file object (like a sequence) can create an iterator which will yield the individual lines of the file. We looked at how sequences work with the for statement in Looping Back : Iterators, the for statement and Generators. Here, we’ll use the file object in a for statement to read all of the lines.
Additionally, the print statement can make use of a file other than standard output as a destination for the printed characters. This will change with Python 3.0, so we won’t emphasize this.
Opening and Reading From a File. Let’s say we have the following file. If you use an email service like HotMail, Yahoo! or Google, you can download an address book in Comma-Separated Values ( CSV ) format that will look similar to this file. Yahoo!’s format will have many more columns than this example.
"First","Middle","Last","Nickname","Email","Category" "Moe","","Howard","Moe","firstname.lastname@example.org","actor" "Jerome","Lester","Howard","Curly","email@example.com","actor" "Larry","","Fine","Larry","firstname.lastname@example.org","musician" "Jerome","","Besser","Joe","email@example.com","actor" "Joe","","DeRita","CurlyJoe","firstname.lastname@example.org","actor" "Shemp","","Howard","Shemp","email@example.com","actor"
Here’s a quick example that shows one way to read this file using the file’s iterator. This isn’t the best way, that will have to wait for The csv Module.
1 2 3 4
dataSource = file( "name_addr.csv", "r" ) for addr in dataSource: print(addr) dataSource.close()
A More Complete Reader. Here’s a program that reads this file and reformats the individual records. It prints the results to standard output. This approach to reading CSV files isn’t very good. In the next chapter, we’ll look at the csv module that handles some of the additional details required for a really reliable program.
1 2 3 4 5 6 7 8 9 10
#!/usr/bin/env python """Read the name_addr.csv file.""" dataSource = file( "name_addr.csv", "r" ) for addr in dataSource: # split the string on the ,'s quotes= addr.split(",") # strip the '"'s from each field fields= [ f.strip('"') for f in quotes ] print( fields, fields, fields, fields ) dataSource.close()
Seeing Output with print. The print() function does two things. When we introduced print() back in Seeing Results : The print Statement, we hustled past both of these things because they were really quite advanced concepts.
The print() function has one more feature which can be very helpful to us. We can provide a file parameter to redirect the output to a particular file.
We can use this to write lines to sys.stderr.
1 2 3 4 5
from __future__ import print_function import sys print("normal output") print("Red Alert!", file=sys.stderr) print("still normal output", file=sys.stdout)
When you run this in IDLE, you’ll notice that the error messages display in red, while the standard output displays in blue.
Print Command. Here is the syntax for an extension to the print statement.
print >> file [ , expression , ... ]
The >> is an essential part of this peculiar syntax. This is an odd special case punctuation that doesn’t appear elsewhere in the Python language. It’s called the “chevron print”.
This chevron print syntax will go away in Python 3. Instead of a print statement with a bunch of special cases, we’ll use the print() function.
Opening A File and Printing. This example shows how we open a file in the local directory and write data to that file. In this example, we’ll create an HTML file named addressbook.html. We’ll write some content to this file. We can then open this file with FireFox or Internet Explorer and see the resulting web page.
1 2 3 4 5 6 7 8 9 10 11
#!/usr/bin/env python """Write the addressbook.html page.""" from __future__ import print_function new_page = open( "addressbook.html", "w" ) print('<html>', new_page) print(' <head>' '<meta http-equiv="content-type" content="text/html; charset=us-ascii">' '<title>addressbook</title></head>', file=new_page) print(' <body><p>Hello world</p></body>', file=new_page ) print('</html>', file=new_page) new_page.close()
Some disk devices are organized into cylinders and tracks instead of blocks. A disk may have a number of parallel platters; a cylinder is the stack of tracks across the platters available without moving the read-write head. A track is the data on one circular section of a single disk platter. What advantages does this have? What (if any) complexity could this lead to? How does an application program specify the tracks and sectors to be used?
Some disk devices are described as a simple sequence of blocks, in no particular order. Each block has a unique numeric identifier. What advantages could this have?
Some disk devices can be partitioned. What (if any) relevance does this have to file processing?
Skip The Header Record.
Our name_addr.csv file has a header record. We can skip this record by getting the iterator and advancing to the next item.
Write a variation on nameaddr.py which uses the iter() to get the iterator for the dataSource file. Assign this iterator object to dataSrcIter. If you replace the file, dataSource, with the iterator, dataSrcIter, how does the processing change? What is the value returned by dataSrcIter.next() before the for statement? How does adding this change the processing of the for statement?
Combine The Two Examples.
Our two examples, addrpage.py and name_addr.py are really two halves of a single program. One program reads the names and address, the other program writes an HTML file. We can combine these two programs to reformat a CSV source file into a resulting HTML page.
The name and addresses could be formatted in a web page that looks like the following:
<html> <head><title>Address Book</title></head> <body> <table> <tr><td>last name</td><td>first name</td><td>email address</td></tr> <tr><td>last name</td><td>first name</td><td>email address</td></tr> <tr><td>last name</td><td>first name</td><td>email address</td></tr> ... </table> </body> </html>
Each of our input fields becomes an output field sandwiched in between <td> and </td>. In this case, we uses phrases like last name, first name and email address to show where real data would be inserted. The other HTML elements like <table> have to be printed as they’re shown in this example.
Your final program should open two files: name_addr.csv and addressbook.html. Your program should write the initial HTML material (up to the first <tr>) to the output file. It should then read the CSV records, writing a complete address line between <tr> to </tr>. After it finishes reading and writing names and addresses, it has to write the last of the HTML file, from </table> to </html>.
It’s a matter of running out of suitable synonyms. The operating system maintains files as collections of bytes on a disk or USB drive. Python gives us access to those OS files using a Python object called a file. It might have been nicer to call it a file handle, file channel, file socket or file descriptor. But the extra word would eventually get dropped, and we’d be back to Python files giving us access to OS files.
When we look under the hood, it actually gets more complex. Python’s file object is built around the C language FILE, defined in the stdio library, which uses the OS file descriptors which give access to the data on the disk (known as a file). Whew!
Yes, it’s rather complex. But it’s also very, very important because all of your data will be in OS files, and you’ll want to access those OS files using Python file objects.
Actually, direct access to devices is a pretty ugly and complex problem. Without the unifying abstraction of “file”, it would be nearly impossible to get useful data processing accomplished.
The differences between IDE, SATA, SCSI and USB drives is enough to make someone crazy, and they’re all – basically – disks. Each one has unique subtleties to how the device is identified, how requests are sent to it, and the data comes back from the device. Thrown in CD’s and DVD’s and you’ve got even more complex rules for handling the various kinds of “mass storage” media that are connected to your computer.
When you start to look at network interfaces (wireless WiFi, Bluetooth, and Ethernet) you’ll see more differences than similarities. Worse, your program would have to be customized to handle WiFi cards made by different manufacturers. For example, NetGear and LinkSys differences would be part of your program, not part of the operating system.
This is so important, we’ll return to it in File-Related Library Modules.