Python offers a sophisticated family tree of collections: the sequence collections include strings, tuples and lists; the set collections include set and frozenset; the mapping collection is the dictionary. We may want to invent our own kinds of collections to add new features to one of these generic collections, or do things that are unique to our collection of data items.
We’ll talk about collections in general in Collections: The Superclass. The special methods to create a new kind of sequence will be covered in How Do I Make An Object Behave Like A Sequence?. This chapter will present the special method names that we can use to create new kinds of mapping in Making It Behave Like A Mapping.
This chapter is optional material; it is at (or maybe just beyond) the limit of what can be considered appropriate for newbies. It’s included because it can give you a much deeper understanding of how Python works. This may help you to create programs that package sophisticated data or processing in simple-looking programming. It doesn’t cover any new types of data or new statements, so it can be skipped.
When designing a class that behaves like a collection, we’ll need to provide definitions for some essential operators that are common to all varieties of collections. Beyond the essential special method names, there are some method names that we can use to make our collection behave more like a list (indexed by simple integers), and a separate set of special method names that we can use to make our collection behave more like a mapping (indexed by other objects).
Here’s an example of a new kind of collection: we want a new sequence class that adds some statistical functions, but is otherwise list-like. This would allow us to collect sample data into a list-like object and ask that object for its mean or standard deviation. We started this in the exercises at the end of Defining New Objects.
By providing the following special methods, your new class can behave like an standard collection.
Handles len(); returns the number of items in this collection.
This is also used in a boolean context (if, while and bool()) if the class lacks a __nonzero__() method. If the __len__() method returns zero, the object is considered to be False.
Handles object[ key ]; finds item identified by key in this collection.
For sequence types, the accepted keys should be integers or slice objects. For mapping types, any object can be a key. The special interpretation of negative indexes (if the class wishes to emulate a sequence type) must be supported by the __getitem__() method.
If key is of an inappropriate type, TypeError may be raised. If a value outside the set of indexes for the sequence (after any special interpretation of negative values), IndexError should be raised. The for statement requires an IndexError will be raised for illegal indexes to allow proper detection of the end of the sequence.
Handles object[ key ] = value; updates the collection so that object[ key ] is value.
See getitem() for restrictions on the key. This should only be implemented for mappings if the objects support changes to the values for keys, or if new keys can be added.
This is implemented for sequences if elements can be replaced. The same exceptions should be raised for improper key values as for the getitem() method.
Handles del object[ key ]; updates the collection to remove the item identified by key.
See getitem() for restrictions on the key. This should only be implemented for mappings if the objects support removal of keys, or for sequences if elements can be removed from the sequence. The same exceptions should be raised for improper key values as for the getitem() method.
Example. As a quick example, here’s the start of a class which implements the __len__() special method. We’ll define a class which keeps a list of values, and which computes statistics, skipping over any None items in the list.
class StatList( object ):
def __init__( self, seedValues=None ):
if seedValues:
self.theValues= seedValues
else:
self.theValues= []
def __len__( self ):
count = 0
for v in self.theValues:
if v is not None:
count += 1
return count
We can use this list structure in as follows.
>>> a= StatList( [1,2,3,None,5,6] )
>>> len(a)
5
When we asked for len() of our StatList object, a, Python used the __len__() special method that we provided.
A sequence is a collection of individual items; individual items are identified by a numeric index value. A tuple is a kind of sequence that supports relatively few methods, mostly just creation and the __getitem__() method. A list is a kind of sequence with a number of additional methods like __setitem__() that can change the items in the list.
Python doesn’t hold us to just simple integers as the index for our lists. Python allows us to create a slice of a list. When we say someList[2:8] we are picking out 6 items from the original list to make a new list. When we create our own sequences we should support this slicing operation, also.
Before looking at sequences, we’ll look at how Python describes a slice of a sequence. After looking at slices, we can look at the methods we need to implement to create our own kinds of sequences.
slice Objects. When we create a sequence class, our __getitem__(), __setitem__(), __delitem__() method functions must be prepared for the key to be one of the following three kinds of values:
A slice object is a bundle of attributes with few method functions. When we provide Python syntax like a[n:m], the index expression becomes the slice(n,m) object. Here are the important attributes of a Slice object.
| start: | The (optional) start number of the range. If omitted, this will be 0. |
|---|---|
| stop: | The stop number of the range. If omitted, a very large number will be used here, to be sure that the stop number is definitely larger than your sequence. Python rules tell us that this is the index after the last item we process. Remember that a[2:8] stops before index value 8, and processes index value 7. |
| step: | The (optional) step through the range. If omitted, this will None. If the stop number is positive, you will interpret this to be 1. |
Here are some examples of slice objects.
The expression someSequence[1:5] is transformed to someSequence.__getitem__( slice(1,5) ).
This slice object has the following attribute values: key.start = 1, key.stop = 5, key.step = None.
The expression someSequence[2:] is transformed to someSequence.__getitem__( slice(start=2,stop=2147483647) ).
This slice object has the following attribute values: key.start = 1, key.stop = 2147483647, key.step = None.
The expression someSequence[2:8:2] is transformed to someSequence.__getitem__( slice(2,8,2) ). The slice object is assigned to the key parameter has the following attribute values: key.start = 2, key.stop = 8, key.step = 2.
Sequence Method Functions. Sequence types should also provide the following method functions.
Update the sequence to sort the items into order.
If a key is provided, this function is used to pick the key value out of the object.
Sequence types should also implement concatenation (via the + operator) and repetition (via the * operator) by defining the following special names. These were described in How The Numeric Operations Work in a numeric context.
To support the item in self and item not in self operators, you will also need to define the following special method function.
Example of a list with extra features. Let’s look at a situation where we want to create a new kind of list which will only store numeric values. We would use this special-purpose list for collecting statistical data. Because this list has only numbers we can reliably compute averages and standard deviations of the items in the list.
The idea is to update methods like append() and extend() to enforce a numbers-only policy. In the event of an attempt to append a non-number (a string, for example), would result in raising an exception.
We can also add new methods, like sum(), mean(), and standard_deviation() to this new kind of list. This new kind of list, let’s call it NumberList, would let us write programs that look like the following.
>>> data= NumberList()
>>> data.append( 1 )
>>> data.append( 1 )
>>> data.append( 2 )
>>> data.append( 3 )
>>> print data.sum()
7
Making a Subclass of list The easiest way to create a new kind of sequence is to extend the class list. We can write method functions which work by first assuring that the data is valid, as well as using the built-in list methods.
Looking at the methods that put new data into a list, we’ll need to override the following method functions: __setitem__(), append(), insert(), extend(), __add__() and __iadd__(). We’ll write versions of these functions which assure that the values going into the list are numeric.
Generally, what we are doing is extending the existing methods of the list class. We can use the methods of list with super(NumberList,self). We can write our own version of __setitem__() that checks our constraints, but uses the superclass super(NumberList,self).__setitem__ to do the real work of changing the list object.
We’ll use the laziest possible check for a valid number: we’ll evaluate the int() function. If the data value is numeric (including a string of characters that looks like a number), it will be converted to an integer. If the data value is not numeric, an exception will be raised, and the program will stop. The program which uses our NumberList must be aware of the exception-raising, and use appropriate try statements.
class NumberList( list ):
"""A list of integers.
Throws exceptions if non-integer data is supplied"""
def __setitem__( self, index, number ):
"""l[a]= n will throw an exception if n is not a valid integer"""
super(NumberList,self).__setitem__( index, int( number ) )
def append( self, number ):
super(NumberList,self).append( int( number ) )
def insert( self, index, number ):
super(NumberList,self).insert( index, int( number ) )
def extend( self, iterable ):
for value in iterable:
self.append( int( value ) )
def __add__( self, numberList ):
return super(NumberList,self).__add__( map( int, numberList ) )
def __iadd__( self, numberList ):
super(NumberList,self).__iadd__( map( int, numberList ) )
Using NumberList. The following example shows how we created an instance of our new NumberList class and exercise that object, inserting numbers as well as strings. You’ll see that any attempt to put a string into the object will raise an exception.
>>> a = NumberList()
>>> a.append( 1 )
>>> a.extend( [ 1,2,3 ] )
>>> a
[1, 1, 2, 3]
>>> a.append( "what?" )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "numberList.py", line 5, in append
list.append( self, int( value ) )
ValueError: invalid literal for int() with base 10: 'what?'
Once we’ve constrained our list to accept only integer values, we can add methods to this class that compute sums, averages, medians, modes, and standard deviations. Since only integer values can be inserted, we don’t have to concern ourselves with an unexpected string or other unprocessable value.
Extra Features - Sum. Here’s how we can implement a sum() method. This will iterate through the collection of values computing the total. We can use this to compute an average, also.
class NumberList( list ):
#
# Rest of the definitions from above
#
def sum( self ):
total= 0
for v in self:
total += v
return total
This works because the self variable is, itself, a list. This is the most important consequence of defining our new class to extend the built-in list class.
You can use the built-in sum() function, also. The definition would look like this.
class NumberList( list ):
#
# Rest of the definitions from above
#
def sum( self ):
return sum(self)
That looks really, really strange, but it works. Remember that def sum is the definition of a method function belonging to NumberList, with a parameter of self.
The body of that method function evaluates sum(self), which will apply the build-in sum() function to a list object, self.
How does Python know to use the built-in sum() function? Easy. The name (sum) doesn’t have a qualifier. THe name self.sum is the function being defined. The name sum is some global or built-in name outside the class definition.
A mapping is a collection of items, each of which is identified by an object that is the key. Each key object must provide a fixed value. A dictionary uses the hash() function to get a fixed numeric value from a key object.
In Special Behavior Requires Special Methods we noted that the hash() function works by calling the __hash__() method function of an object. Strings, tuples and numbers all produce consistent hash values. Lists and dictionaries, however, are mutable and can’t produce consistent hash values, which means they can’t be used as keys for mappings.
Mappings are collections, however, they provide slightly different forms of __setitem__(), __getitem__() and __delitem__(). With a sequence, the key is an integer position or slice. For a mapping, the key value will be an immutable object.
Mapping types should provide the following method functions.
key in self.
Return True if the key key occurs in this mapping, otherwise return False.
We may want to also define has_key(); it does the same thing, but is an explicit method function instead of the in operator.
Example of a mapping with Extra Features. You could, for example, create a new dictionary that will simplify developing a frequency table and determining the mode of a set of values.
Note that a list simply collects all the values. Instead, we’d like to create a mapping which has the unique objects as the keys, and the frequency of each object as the value. This will be a more compact way of maintaining large sets of data with relatively few distinct values.
Accumulating frequency tables using built-in dict has a complexity. First we must determine if an object is a key in the dictionary; if it exists, we can update the value, which is a count of occurrences. If, on the other hand, the object is not in the dictionary, we have to add it to the dictionary.
Doing frequency counts with a dictionary often looks like this, which is more programming than the problem really deserves.
oldCount= frequencyTable.setdefault( aKey, 0 )
frequencyTable[aKey]= oldCount + 1
Using* defaultdict. The collections module has defaultdict, which makes it much easier to create frequency counts. The defaultdict has a special version of __getitem__() that handles the missing key situation for us. If the key does not exist, defaultdict calls a supplied function to create a value for that key.
We can use it like this.
from collections import defaultdict
frequencyTable = defaultdict( int )
frequencyTable[aKey] += 1
We’ve avoided the little setdefault() dance.
Creating a Mapping as a Subclass of defaultdict. We’ll create a new type of mapping by extending collections.defaultdict with a mode() method.
We’ll also touch up the __init__() method to use the int() function as the default value for the dictionary. This will assure that all items in the dictionary have a default value of 0.
class FQTableMode( defaultdict ):
def __init__( self ):
super( FQTableMode, self ).__init__( int )
def mode( self ):
"""Compute mode by sorting frequency table into order by frequency."""
def byValue( item ): return item[1]
key, value = sorted( self.items(), key=byValue )[-1]
return key
Using FQTableMode. We can use our new class as follows:
Using the Dictionary Subclass
import die
# Definition of FQTableMode from above.
d= die.Dice()
frequency= FQTableMode()
for i in range(1000):
d1,d2= d.roll()
frequency[ d1+d2 ] += 1
print "mode", frequency.mode(), frequency[frequency.mode()]
Sequence with Statistics.
Create a sequence class, Statistics can hold a sequence of data values. This class should define all of the usual sequence operators, including __add__(), __radd__(), __iadd__(), but not __mul__(). The __init__() function should accept a sequence to initialize the collection.
The various __add__() functions should append the values from a Statistics instance as well as from ordinary sequences like list and tuple. Since objects of list and tuple respond to the iterator in a for statement, there should be no difference between handling an object of class Statistics, list or tuple.
Most importantly, this class should define the usual statistical functions like mean and standard deviation, described in the exercises after Doubles, Triples, Quadruples : The tuple and Statistics Library in Defining New Objects.
Since this class can be used everywhere a sequence is used, the look of a program should not change, but extra features are now readily available. For a test, something like the following should be used:
import random
samples = Statistics( [ random.randrange(6) for i in range(100) ] )
print samples.mean()
s2= Statistics()
for i in range(100):
s.append( random.randrange(6) )
# Allow s += [ random.randrange(6) ] , also
print s2.mean()
Mapping with Statistics.
Create a mapping class, FrequencyTable that holds a collection of data values and their frequency count. The value in the mapping is always a number, greater than or equal to zero.
This class should support all of the usual mapping operators, including __getitem__(), __setitem__(), setdefault(). The __init__() function should accept a sequence to initialize the mapping. This support can be through inheritance or by implementing the operators from scratch.
This class must also compute other statistics like mean and standard deviation. Note that the descriptions in the exercises after Doubles, Triples, Quadruples : The tuple and Statistics Library in Defining New Objects don’t cover the algorithms for doing statistics from frequency tables. The change isn’t too tricky: instead of simply summing the keys, you sum the key times the value. We’ll leave it at that because only serious statisticians will want to pursue this variation on the exercise.
Stocks with Statistics.
Create a mapping class, StockPortfolio that holds a collection of company ticker ID’s and blocks of each company. The value in the mapping can be a list.
The essential definition can be:
class Position( defaultdict ):
def __init__( self ):
super( Portfolio, self ).__init__( list )
You should be able to use this class as follows.
p = Position()
p['EK'].append( ShareBlock( purchDate='25-Jan-2001', purchPrice=35.86, shares=22 ) )
p['EK'].append( ShareBlock( purchDate='25-Apr-2001', purchPrice=37.66, shares=21 ) )
print p['EK']
Add a method to this Position to compute the total value for a given key.
Actually, it happens frequently. Much of data processing deals with collections of information. The basic set, sequence, mapping is fine for a number of common situations, but we often want additional features.
A list which includes sum and average, for example, happens so often that there are modules you can download which do this for you. The Python NumPy module is an example of these additional, sophisticated collections. Look at http://numpy.scipy.org/ for more information.
Actually, many problems involve collections. Let’s say you want to create a database with recipes. A recipe has two parts: a collection of ingredients and a collection of step for cooking. A recipe is an object with two specialized collections inside it.
A portfolio is a collection of stock positions, each one of which is more than a simple list of blocks. A stock position, has a purchase price, but it also has a current value, computed from the current trading price. The portfolio collection has lots of additional methods for evaluating the portfolio as a whole.