External Data and Files

All programs must deal with external data. They will either accept data from sources outside the text of the program, or they will produce some kind of output, or they will do both. Think about it: if the program produces no output, how do you know it did anything?

By external data, we mean data outside of volatile, high-speed, primary memory; we mean data on peripheral devices. This may be persistent data on a disk, or transient data on a network interface. For now, it may mean transient data displayed on our terminal.

Most operating systems provide simple, uniform access to external data via the abstraction called a file. We’ll look at the operating system implementation, as well as the Python class that gives us access to the operating system file in our programs.

In File Objects – Our Connection To The File System, we provide definitions of how Python works with files. We cover the built-in functions for working with files in The File and Open Functions. In Methods We Use on File Objects, we describe some method functions of file objects. We’ll look at file-processing statements in File Statements: Reading and Writing (but no Arithmetic).

File Objects – Our Connection To The File System

Abstractions Built on Top of Abstractions. Files do a huge number of things for us. To support this broad spectrum of capabilities, there are two layers of abstraction involved: the OS and Python. Unfortunately, both layers use the same words, so we have to be careful about casually misusing the word “file”.

The operating system has devices of various kinds. All of the various devices are unified using a common abstraction that we call the file system. All of a computer’s devices appear as OS files of one kind or another. Some things which aren’t physical devices also appear as files. Files are the plumbing that move data around our information infrastructure.

Additionally, Python defines file objects. These file objects are the fixtures that give our Python program access to OS files.

The following figure shows this technology stack. Your program makes use of Python File objects. Python, in turn, makes use of OS file objects. Yes, it can be confusing that “file” is used for both things. However, you only have to focus on the Python file; the rest is just infrastructure to support your needs.

../_images/p10c4-fig18.png

Python File and OS File

How Files Work. When your program evaluates a method function of a Python file object, Python transforms this into an operation on the underlying OS file. An OS file operation becomes an operation on one of the various kinds of devices attached to our computer. Or, a OS file operation can become a network operation that reaches through the Internet to access data from remote computers. The two layers of abstraction mean that one Python program can do a wide variety of things on a wide variety of devices.

Python File Objects

In Python, we create a file object to work with files in the file system. In addition to files in the OS’s file system, Python recognizes a spectrum of file-like objects, including abstractions for network interfaces called pipes and sockets and even some kind of in-memory buffers.

Unlike sequences, sets and mappings, there are no Python literals for file objects. Lacking literals, we create a file object using the file() or open() factory function. We provide two pieces of information to this function. We can provide a third, optional, piece of information that may improve the performance of our program.

  • The name of the file. The operating system will interpret this name using its “working directory” rules. If the name starts with / (or device:\) it’s an absolute name. Otherwise, it’s a relative name; the current working directory plus this name identifies the file.

    Python can translate standard paths (using /) to Windows-specific paths. This saves us from having to really understand the differences. We can name all of our files using /, and avoid the messy details.

    We can, if we want, use raw strings to specify Windows path names using the \ character.

  • The access mode for the file. This is some combination of read, write and append. The mode can also include instructions for interpreting the bytes as characters.

  • Optionally, we can include the buffering for the file. Generally, we omit this. If the buffering argument is given, 0 means each byte is transferred as it is read or written. A value of 1 means the data is buffered a line at a time, suitable for reading from a console, or writing to an error log. Larger numbers specify the buffer size: numbers over 4,096 may speed up your program.

Once we create the file object, we can do operations to read characters from the file or write characters to the file. We can read individual characters or whole lines. Similarly, we can write individual characters or whole lines.

When Python reads a file as a sequence of lines, each line will become a separate string. The '\n' character is preserved at the end of the string. This extra character can be removed from the string using the rstrip() method function.

A file object (like a sequence) can create an iterator which will yield the individual lines of the file. You can, consequently, use the file object in a for statement. This makes reading text files very simple.

When the work is finished, we also need to use the file’s close() method. This empties the in-memory buffers and releases the connection with the operating system file. In the case of a socket connection, this will release all of the resources used to assure that data travels through the Internet successfully.

The File and Open Functions

Here’s the formal definition of the file() and open() factory functions. These functions create Python file objects and connect them to the appropriate operating system resources.

open(filename, mode[, buffering]) → file

The filename is the name of the file. This is simply given to the operating system. The OS expects eitther absolute or relative paths; the operating system folds in the current working directory to relative paths.

The mode is covered in detail below. In can be 'r', 'w' or 'a' for reading (default), writing or appending. If the file doesn’t exist when opened for writing or appending, it will be created. If a file existed when opened for writing, it will be truncated and overwritten. Add a 'b' to the mode for binary files. Add a '+' to the mode to allow simultaneous reading and writing.

If the buffering argument is given, 0 means unbuffered, 1 means line buffered, and larger numbers specify the buffer size.

file(filename, mode[, buffering]) → file

This is another name for the open() function. It parallels other factory functions like int() and dict().

Python expects the POSIX standard punctuation of / to separate elements of the filename path for all operating systems. If necessary, Python will translate these standard name strings to the Windows punctuation of \. Using standardized punctuation makes your program portable to all operating systems. The os.path module has functions for creating valid names in a way that works on all operating systems.

Tip

Constructing File Names

When using Windows-specific punctuation for filenames, you’ll have problems because Python interprets the \ as an escape character. To create a string with a Windows filename, you’ll either need to use \ in the string, or use an r" " string literal. For example, you can use any of the following: r"E:\writing\technical\pythonbook\python.html" or "E:\\writing\\technical\\pythonbook\\python.html".

Note that you can often use "E:/writing/technical/pythonbook/python.html". This uses the POSIX standard punctuation for files paths, /, and is the most portable. Python generally translates standard file names to Windows file names for you.

Generally, you should either use standard names (using /) or use the os.path module to construct filenames. This module eliminates the need to use any specific punctuation. The os.path.join() function makes properly punctuated filenames from sequences of strings

The Mode String. The mode string specifies how the OS file will be accessed by your program. There are four separate issues addressed by the mode string: opening, bytes, newlines and operations.

  • Opening. For the opening part of the mode string, there are three alternatives:

    r:Open for reading. Start at the beginning of the OS file. If the OS file does not exist, raise an IOError exception. This is the default.
    w:Open for writing. Start at he beginning of the OS file. If the OS file does not exist, create the OS file.
    a:Open for appending. Start at the end of the OS file. If the OS file does not exist, create the OS file.
  • Bytes or Characters. For the byte handling part of the mode string, there are two alternatives:

    b:The OS file is a sequence of bytes; do not interpret the file as a sequence of characters. This is suitable for .csv files as well as images, movies, sound samples, etc.

    The default, if b is not included, is to interpret the file is a sequence of ordinary characters. The Python file object will be an iterator that yields each individual line from the OS file as a separate string. Translations from various encoding schemes like UTF-8 and UTF-16 will be handled automatically.

  • Universal Newlines. The newline part of the mode string has two alternatives:

    U:Universal newline interpretation. The first instance of \n, \r\n (or \r) will define the newline character(s). Any of these three newline sequences will be silently translated to the standard '\n' character. The \r\n is a Windows feature.

    The default, if U is not included, is to only handle this operating system’s standard newline character(s).

  • Mixed Operations. For the additional operations part of the mode string, there are two alternatives:

    +:Allow both read and write operations to the OS file.

    The default, if + is not included, is to allow only limited operations: only reads for files opened with “r”; only writes for OS files opened with “w” or “a”.

Typical combinations include the following:

  • "r" to read text files.
  • "rb" to read binary files. A .csv file, for example, is often processed in binary mode.
  • "w+" to create new text file for reading and writing.

The following examples create Python file objects for further processing:

dataSource= open( "name_addr.csv", "rb" )
newPage= open( "addressbook.html", "w" )
theErrors= open( "/usr/local/log/error.log", "a" )
dataSource:

This example opens the existing file name_addr.csv in the current working directory for reading. The variable dataSource identifies this file object, and we can use this variable for reading strings from this file.

This file is opened in binary mode.

newPage:

This example creates a new file addressbook.html (or it will truncate this file if it exists). The file will be in the current working directory. The variable newPage identifies the file object. We can then use this variable to write strings to the file.

theErrors:

This example appends to the file error.log (or creates a new file, if the file doesn’t exist). The file has the directory path /usr/local/log/. Since this is an absolute name, it doesn’t depend on the current working directory.

Buffering files is typically left as a default, specifying nothing. However, for some situations, adjusting the buffering can improve performance. Error logs, for instance, are often unbuffered, so the data is available immediately. Large input files may be opened with large buffer numbers to encourage the operating system to optimize input operations by reading a few large chunks of data from the device instead of a large number of smaller chunks.

Tip

Debugging Files

There are a number of things that can go wrong in attempting to create a file object.

If the file name is invalid, you will get operating system errors. Usually they will look like this:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'wakawaka'

It is very important to get the file’s path completely correct. You’ll notice that each time you start IDLE, it thinks the current working directory is something like C:\Python26. You’re probably doing your work in a different default directory.

When you open a module file in IDLE, you’ll notice that IDLE changes the current working directory is the directory that contains your module. If you have your .py files and your data files all in one directory, you’ll find that things work out well.

The next most common error is to have the wrong permissions. This usually means trying to writing to a file you don’t own, or attempting to create a file in a directory where you don’t have write permission. If you are using a server, or a computer owned by a corporation, this may require some work with your system administrators to sort out what you want to do and how you can accomplish it without compromising security.

The [Errno 2] note in the error message is a reference to the internal operating system error numbers. There are over 100 of these error numbers, all collected into the module named errno. There are a lot of different things that can go wrong, many of which are very, very obscure situations.

Methods We Use on File Objects

The Python file object is our view of the underlying operating system file. The OS file, in turn, gives us access to a specific device.

The Python file object has a number of operations that transform the file object, read from or write to the OS file, or access information about the file object.

Reading. The following read methods get data from the OS file. These operations may also change the Python file object’s internal status and buffers. For example, at end-of-file, the internal status of the file object will be changed. Most importantly, these methods have the very visible effect of consuming data from the OS file.

file.read(size) → string

Read as many as size characters from file f as a single, large string. If size is negative or omitted, the rest of the file is read into a single string.

from __future__ import print_function
dataSource= open( "name_addr.csv", "r" )
theData= dataSource.read()
for n in theData.splitlines():
    print(n)
dataSource.close()
file.readline(size) → string

Read the next line or as many as size characters from file f; an incomplete line can be read. If size is negative or omitted, the next complete line is read. If a complete line is read, it includes the trailing newline character. If the file is at the end, f. readline() returns a zero length string. If the file has a blank line, this will be a string of length 1, just the newline character.

from __future__ import print_function
dataSource= file( "name_addr.csv", "r" )
n= dataSource.readline()
while len(n) > 0:
    print(n.rstrip())
    n= dataSource.readline()
dataSource.close()
file.readlines(hint)

Read the next lines or as many lines from the next hint characters from file f. The hint size may be rounded up to match an internal buffer size. If hint is negative or omitted, the rest of the file is read. All lines will include the trailing newline character. If the file is at the end, f. readlines() returns a zero length list.

When we simply reference a file object in a for statement, this is the function that’s used for iteration over the file.

dataSource= file( "name_addr.csv", "r" )
for n in dataSource:
    print(n.rstrip())
dataSource.close()

Writing. The following methods send data to the OS file. These operations may also change the Python file object’s internal status and buffers. Most importantly, these methods have the very visible effect of producing data to the OS file.

file.flush()

Flush all accumulated data from the internal buffers of file f to the device or interface. If a file is buffered, this can help to force writing of a buffer that is less than completely full. This is appropriate for log files, prompts written to sys.stdout and error messages.

file.truncate(size)

Truncate file f. If size is not given, the file is truncated at the current position. If size is given, the file will be truncated at or before size. This function is not available on all platforms.

file.write(string)

Write the given string to file f. Buffering may mean that the string does not appear on a console until a close() or flush() operation is used.

newPage= file( "addressbook.html", "w" )
newPage.write( "<html>\n<head><title>Hello World</title></head>\n<body>\n" )
newPage.write( "<p>Hello World</p>\n" )
newPage.write( "<\body>\n</html>\n" )
newPage.close()
file.writelines(list)

Write the list of strings to file f. Buffering may mean that the strings do not appear on any console until a close() or flush() operation is used.

newPage= file( "addressbook.html", "w" )
newPage.writelines( [ "<html>\n", "<head><title>Hello World</title></head>\n", "<body>\n" ] )
newPage.writelines( ["<p>Hello World</p>\n" ] )
newPage.writelines( [ "<\body>\n", "</html>\n" ] )
newPage.close()

Accessors. The following file accessors provide information about the file object.

file.tell() → integer

Return the position from which file f will be processed. This is a partner to the seek() method; any position returned by the tell() method can be used as an argument to the seek() method to restore the file to that position.

file.fileno() → integer

Return the internal file descriptor (fd) number used by the OS library when working with file f. A number of modules provide access to these low-level libraries for advanced operations on devices and files.

file.isatty() → boolean

Return True if file f is connected to an OS file that is a console or keyboard.

file.closed() → boolean

This attribute of file f is True if the file is closed.

file.mode() → string

This attribute is the mode argument to the file() function that was used to create the file object.

file.name

This attribute of file f is the filename argument to the file() function that was used to create the file object.

Transfomers. The following file transforms change the file object itself. This includes closing it (and releasing all OS resources) or change the position at which reading or writing happens.

file.close()

Close file f. The closed flag is set. Any further operations (except a redundant close) raise an IOError exception.

file.seek(offset[, whence])

Change the position from which file f will be processed. There are three values for whence which determine the direction of the move.

If whence is 0 (the default), move to the absolute position given by offset. f.seek(0) will rewind file f.

If whence is 1, move relative to the current position by offset bytes. If offset is negative, move backwards; otherwise move forward.

If whence is 2, move relative to the end of file. f.seek(0,2) will advance file f to the end.

File Statements: Reading and Writing (but no Arithmetic)

A file object (like a sequence) can create an iterator which will yield the individual lines of the file. We looked at how sequences work with the for statement in Looping Back : Iterators, the for statement and Generators. Here, we’ll use the file object in a for statement to read all of the lines.

Additionally, the print statement can make use of a file other than standard output as a destination for the printed characters. This will change with Python 3.0, so we won’t emphasize this.

Opening and Reading From a File. Let’s say we have the following file. If you use an email service like HotMail, Yahoo! or Google, you can download an address book in Comma-Separated Values ( CSV ) format that will look similar to this file. Yahoo!’s format will have many more columns than this example.

name_addr.csv

"First","Middle","Last","Nickname","Email","Category"
"Moe","","Howard","Moe","moe@3stooges.com","actor"
"Jerome","Lester","Howard","Curly","curly@3stooges.com","actor"
"Larry","","Fine","Larry","larry@3stooges.com","musician"
"Jerome","","Besser","Joe","joe@3stooges.com","actor"
"Joe","","DeRita","CurlyJoe","curlyjoe@3stooges.com","actor"
"Shemp","","Howard","Shemp","shemp@3stooges.com","actor"

Here’s a quick example that shows one way to read this file using the file’s iterator. This isn’t the best way, that will have to wait for The csv Module.

1
2
3
4
dataSource = file( "name_addr.csv", "r" )
for addr in dataSource:
    print(addr)
dataSource.close()
  1. We create a Python file object for the name_addr.csv in the current working directory in read mode. We call this object dataSource.
  2. The for statement creates an iterator for this file; the iterator will yield each individual line from the file.
  3. We can print each line.
  4. We close the file when we’re done. This releases any operating system resources that our program tied up while it was running.

A More Complete Reader. Here’s a program that reads this file and reformats the individual records. It prints the results to standard output. This approach to reading CSV files isn’t very good. In the next chapter, we’ll look at the csv module that handles some of the additional details required for a really reliable program.

nameaddr.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python
"""Read the name_addr.csv file."""
dataSource = file( "name_addr.csv", "r" )
for addr in dataSource:
    # split the string on the ,'s
    quotes= addr.split(",")
    # strip the '"'s from each field
    fields= [ f.strip('"') for f in quotes ]
    print( fields[0], fields[1], fields[2], fields[4] )
dataSource.close()
  1. We open the file name_addr.csv in our current working directory. The variable dataSource is our Python file object.
  2. The for statement gets an iterator from the file. It can then use the iterator, which yields the individual lines of the file. Each line is a long string. The fields are surrounded by "s and are separated by ,s.
  1. We use the split() function to break the string up using the ,s. This particular process won’t work if there are ,s inside the quoted fields. We’ll look at the csv module to see how to do this better.
  1. We use the strip() function to remove the "s from each field. Notice that we used a list comprehension to map from a list of fields wrapped in "s to a list of fields that are not wrapped in "s.

Seeing Output with print. The print() function does two things. When we introduced print() back in Seeing Results : The print Statement, we hustled past both of these things because they were really quite advanced concepts.

We covered strings in Sequences of Characters : str and Unicode. We’re covering files in this chapter. Now we can open up the hood and look closely at the print() function.

  1. The print() function evaluates all of its expressions and converts them to strings. In effect, it calls the str() built-in function for each argument value.
  2. The print() function writes these strings, separated by a separator character, sep. The default separator is a space, ' '.
  3. The print() function also writes an end character, end. The default end is the newline character, '\n'.

The print() function has one more feature which can be very helpful to us. We can provide a file parameter to redirect the output to a particular file.

We can use this to write lines to sys.stderr.

1
2
3
4
5
from __future__ import print_function
import sys
print("normal output")
print("Red Alert!", file=sys.stderr)
print("still normal output", file=sys.stdout)
  1. We enable the print function.
  2. We import the sys module.
  3. We write a message to standard output using the undecorated print statement.
  4. We use the file parameter to write to sys.stderr.
  5. We also use the:varname:file parameter to write to sys.stdout.

When you run this in IDLE, you’ll notice that the error messages display in red, while the standard output displays in blue.

Print Command. Here is the syntax for an extension to the print statement.

print  >> file [ ,  expression , ... ]

The >> is an essential part of this peculiar syntax. This is an odd special case punctuation that doesn’t appear elsewhere in the Python language. It’s called the “chevron print”.

Important

Python 3

This chevron print syntax will go away in Python 3. Instead of a print statement with a bunch of special cases, we’ll use the print() function.

Opening A File and Printing. This example shows how we open a file in the local directory and write data to that file. In this example, we’ll create an HTML file named addressbook.html. We’ll write some content to this file. We can then open this file with FireFox or Internet Explorer and see the resulting web page.

addrpage.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/usr/bin/env python
"""Write the addressbook.html page."""
from __future__ import print_function
new_page = open( "addressbook.html", "w" )
print('<html>', new_page)
print(' <head>'
    '<meta http-equiv="content-type" content="text/html; charset=us-ascii">'
    '<title>addressbook</title></head>', file=new_page)
print(' <body><p>Hello world</p></body>', file=new_page )
print('</html>', file=new_page)
new_page.close()

Basic File Exercises

  1. Device Structures.

    Some disk devices are organized into cylinders and tracks instead of blocks. A disk may have a number of parallel platters; a cylinder is the stack of tracks across the platters available without moving the read-write head. A track is the data on one circular section of a single disk platter. What advantages does this have? What (if any) complexity could this lead to? How does an application program specify the tracks and sectors to be used?

    Some disk devices are described as a simple sequence of blocks, in no particular order. Each block has a unique numeric identifier. What advantages could this have?

    Some disk devices can be partitioned. What (if any) relevance does this have to file processing?

  2. Skip The Header Record.

    Our name_addr.csv file has a header record. We can skip this record by getting the iterator and advancing to the next item.

    Write a variation on nameaddr.py which uses the iter() to get the iterator for the dataSource file. Assign this iterator object to dataSrcIter. If you replace the file, dataSource, with the iterator, dataSrcIter, how does the processing change? What is the value returned by dataSrcIter.next() before the for statement? How does adding this change the processing of the for statement?

  3. Combine The Two Examples.

    Our two examples, addrpage.py and name_addr.py are really two halves of a single program. One program reads the names and address, the other program writes an HTML file. We can combine these two programs to reformat a CSV source file into a resulting HTML page.

    The name and addresses could be formatted in a web page that looks like the following:

    <html>
    <head><title>Address Book</title></head>
    <body>
    <table>
    <tr><td>last name</td><td>first name</td><td>email address</td></tr>
    <tr><td>last name</td><td>first name</td><td>email address</td></tr>
    <tr><td>last name</td><td>first name</td><td>email address</td></tr>
    ...
    </table>
    </body>
    </html>

    Each of our input fields becomes an output field sandwiched in between <td> and </td>. In this case, we uses phrases like last name, first name and email address to show where real data would be inserted. The other HTML elements like <table> have to be printed as they’re shown in this example.

    Your final program should open two files: name_addr.csv and addressbook.html. Your program should write the initial HTML material (up to the first <tr>) to the output file. It should then read the CSV records, writing a complete address line between <tr> to </tr>. After it finishes reading and writing names and addresses, it has to write the last of the HTML file, from </table> to </html>.

File FAQ’s

Why are there two meanings for “file”?

It’s a matter of running out of suitable synonyms. The operating system maintains files as collections of bytes on a disk or USB drive. Python gives us access to those OS files using a Python object called a file. It might have been nicer to call it a file handle, file channel, file socket or file descriptor. But the extra word would eventually get dropped, and we’d be back to Python files giving us access to OS files.

When we look under the hood, it actually gets more complex. Python’s file object is built around the C language FILE, defined in the stdio library, which uses the OS file descriptors which give access to the data on the disk (known as a file). Whew!

Yes, it’s rather complex. But it’s also very, very important because all of your data will be in OS files, and you’ll want to access those OS files using Python file objects.

Why do they have files? What’s wrong with accessing devices directly? Wouldn’t it be simpler?

Actually, direct access to devices is a pretty ugly and complex problem. Without the unifying abstraction of “file”, it would be nearly impossible to get useful data processing accomplished.

The differences between IDE, SATA, SCSI and USB drives is enough to make someone crazy, and they’re all – basically – disks. Each one has unique subtleties to how the device is identified, how requests are sent to it, and the data comes back from the device. Thrown in CD’s and DVD’s and you’ve got even more complex rules for handling the various kinds of “mass storage” media that are connected to your computer.

When you start to look at network interfaces (wireless WiFi, Bluetooth, and Ethernet) you’ll see more differences than similarities. Worse, your program would have to be customized to handle WiFi cards made by different manufacturers. For example, NetGear and LinkSys differences would be part of your program, not part of the operating system.

This is so important, we’ll return to it in File-Related Library Modules.

Table Of Contents

Previous topic

Working with Files

Next topic

Files, Contexts and Patterns of Processing

This Page