Learning Python: A Hands-On Approach

3.2 Edition

Steven F. Lott slott56@gmail.com

http://www.itmaybeahack.com/homepage/books/

http://www.itmaybeahack.com/book/python/python3_handson.html

Topics

  1. Install, Philosophy
  2. Code Kata
  3. Data Structures and Statements
  4. Functions
  5. Modules
  6. Conclusion

Install

http://www.python.org/

Python 2.x already on all Linux and Mac OS X.

Add Python 3.2 (don't replace or remove Python 2.x)

Philosophy

All The Parts You Need

parts.jpg

Ways of Working

Read-Execute-Print Loop (REPL)

Scripts

IDE

Interactive Python

$ python3.2
Python 3.2 (r32:88452, Feb 20 2011, 10:19:59)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 355/113
3.1415929203539825

REPLoop:

Script Execution

$ python3.2 some_script.py

Alternate...

Put this #! line first

#!/usr/bin/env python3.2

All shells assure that this works, too

$ ./some_script.py

Integrated Development Environment

Python comes with IDLE as an IDE.

Komodo Edit is particularly good.

Any programming text editor should handle Python syntax coloring.

There are a lot of IDE's

Code Kata

Simple scraping of the Apache log from a web server.

Find out some basic facts about who's reading my books.

Each monthly log is ~15Mb.

Data Model

Apache Log in "Common Log Format" or Combined Log Format

http://httpd.apache.org/docs/2.0/logs.html

Here's the format string

%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"

Nine fields...

Fields

Request field needs further parsing into "method URL Protocol".

Example Log Line

41.191.203.2 - - [01/Feb/2012:03:27:04 -0500]
"GET /homepage/books/python/html/preface.html HTTP/1.1" 200 33322
"http://www.itmaybeahack.com/homepage/books/python/html/index.html"
"Mozilla/5.0 (Windows NT 6.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"

(Wrapped in a failed attempt to fit the screen.)

Enough Background

Let's Code.

We'll assume that the logs have been downloaded and unzipped.

http://www.itmaybeahack.com/book/python/itmaybeahack.com.bkup-Mar-2012

It's not that download and unzip is difficult to code in Python.

Part 2

We'll look at Python Data Structures

and Statements

that help parse the Apache Combined Log Format.

Data Structures

Python has Files, Strings, Tuples, Lists, Mappings, Sets, etc.

Open and Read a File

>>> path = "../../Work/ItMayBeAHack/itmaybeahack.com.bkup-Mar-2012"
>>> source = open(path, 'r')

Gives us a file object, open and ready to read.

>>> source.read()
Spew
>>> source.close()

We'll need to do better.

Files are Sequences

The Python file class shares methods with other sequences.

It's iterable: it works with the for statement.

>>> source = open(path,'r')
>>> for line in source:
...    print( line )
...

32,000 lines of Spew

That's a start.

Statements We've Seen

Assignment (variable = expression)

Expression (file.read())

for Loop (for variable in iterable:)

The for statement is a "compound" statement.

It has an indented block.

REPL Interaction

Simple statements are complete on one line.

Compound statements (like for) have an indented block.

Indentation is significant.

Outdenting ends a compound statement.

String Objects

>> line
'157.55.18.26 - - [01/Mar/2012:22:07:15 -0500] "GET /homepage/books/oodesign/build-python/html/roulette/player.html HTTP/1.1" 200 39296 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"\n'

How do we break up a string object into individual fields?

String Splitter

>>> fields= line.split()
>>> fields
['157.55.18.26', '-', '-', '[01/Mar/2012:22:07:15', '-0500]', '"GET', '/homepage/books/oodesign/build-python/html/roulette/player.html', 'HTTP/1.1"', '200', '39296', '"-"', '"Mozilla/5.0', '(compatible;', 'bingbot/2.0;', '+http://www.bing.com/bingbot.htm)"']
>>> fields[3]
'[01/Mar/2012:22:07:15'
>>> fields[4]
'-0500]'
>>> len(fields)
15

Regular Expression Digression

import re
format_pat= re.compile(
    r"([\d\.]+)\s+" # digits and .'s: host
    r"(\S+)\s+"     # non-space: logname
    r"(\S+)\s+"     # non-space: user
    r"\[(.+?)\]\s+" # Everything in []: time
    r'"(.+?)"\s+'   # Everything in "": request
    r"(\d+)\s+"     # digits: status
    r"(\S+)\s+"     # non-space: bytes
    r'"(.*?)"\s+'   # Everything in "": referrer
    r'"(.*?)"\s*'   # Everything in "": user agent
)

RE Notes

import re
format_pat= re.compile(
    r"([\d\.]+)\s+" # digits and .'s: host
    ...
    r'"(.*?)"\s*'   # Everything in "": user agent
)

Parsing Each Line

>>> format_pat= re.compile(...) # details omitted
>>> format_pat
<_sre.SRE_Pattern object at 0x100456830>
>>> format_pat.match(line)
<_sre.SRE_Match object at 0x5f2320>

Cool. Now we can use the match object

Parsing Each Line

>>> match= format_pat.match(line)
>>> match.groups()
('190.13.37.18', '-', '-', '04/Feb/2012:22:03:51 -0500',
'GET /homepage/_static/spiral.ico HTTP/1.1', '200',
'894', '-',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.14 (KHTML, like Gecko) Chrome/18.0.972.0
Safari/535.14 SUSE/18.0.972.0')

Sequence (list and tuple) Access

>>> g[0]
'190.13.37.18'
>>> g[3]
'04/Feb/2012:22:03:51 -0500'
>>> g[4]
'GET /homepage/_static/spiral.ico HTTP/1.1'

>>> g[5:7]
('200', '894')

>>> g[-2:]
('-', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.14 (KHTML, like Gecko) Chrome/18.0.972.0 Safari/535.14 SUSE/18.0.972.0')

Slicing a Subset of a Sequence

What's up with this?

>>> g[5:7]
('200', '894')

Negative Index?

>>> g[-2:]
('-', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.14 (KHTML, like Gecko) Chrome/18.0.972.0 Safari/535.14 SUSE/18.0.972.0')

Putting it together

>>> source = open(path,'r')
>>> for line in source:
...     match= format_pat.match(line)
...     print( match.groups()[0] )
...

Spews 32,000 IP addresses.

Now to count them.

Mappings

We want a mapping from a key to value.

Mapping Example

Like this:

>>> counts= { '162.53.28.67': 14, '81.84.242.72': 15 }
>>> host= '162.53.28.67'
>>> counts[host]
14
>>> counts[host] += 1
>>> counts[host]
15

What about Key Not Found?

Ordinary dictionaries raise exceptions when a key is not found.

>>> host= '1.2.3.4'
>>> counts[host] += 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '1.2.3.4'

Not fun for accumulating a frequency count.

The collections module, however, has defaultdict.

Using defaultdict

>>> from collections import defaultdict

>>> counts= defaultdict(int)
>>> host= '1.2.3.4'
>>> counts[host] += 1
>>> counts
defaultdict(<class 'int'>, {'1.2.3.4': 1})

No extra if statement required to check for key in dictionary.

Much nicer.

Immutability

Some classes of objects are immutable.

Some classes of objects are mutable.

Immutable objects never change their value.

Mutable objects can be modified "in place".

One More Statement

It's polite to close files.

It's more fun to have them closed automatically.

with open(path, 'r') as source:
    for line in source:
        etc.

The with statement works with a context manager to assure that the context (i.e., open file) is cleaned up.

Files are context managers. They're closed nicely by with.

The Script

import re
from collections import defaultdict
format_pat = re.compile( details omitted )
path = '../../Work/ItMayBeAHack/itmaybeahack.com.bkup-Mar-2012'
counts = defaultdict(int)
with open(path,'r') as source:
    for line in source:
        fields = format_pat.match(line).groups()
        counts[fields[0]] += 1

What Do We Have?

>>> len(counts)
2594
>>> max( counts.values() )
1648
>>> sum( counts.values() )
50906

2,594 distinct IP addresses

1,648 requests from a single address (in a month!) Creepy

50,906 requests overall

To Do

It's three more lines of code to "invert" the mapping.

Create a new default dict of lists:

Sort the keys into descending order.

Like This

host_by_fq = defaultdict(list)
for k in counts:
    host_by_fq[counts[k]].append(k)

This new mapping has the frequency as the key.

The value is a list of strings. All the strings with that frequency.

host_by_fq[1] will be all the one-time-only IP addresses.

This may be a bit advanced. Let's move on.

Statements

Compound Statements

And that's all there are. Keep it simple.

Part 3: Functions, Objects and Interfaces

objects.jpg

Built From Parts

deck.jpg interior.jpg

Functions

Two flavors of function.

Generator functions are a distinctive feature of Python.

Example Function

Map a request string to a section in the book/ path.

GET /book/python-2.6/html/p02/p02c03_tuples.html HTTP/1.1

def book_section( request ):
    """book_section(request) -> URL path."""
    method, req_path, protocol = request.split()
    path = req_path.split('/')
    if path[1] == "book":
        return path[2:]

Digression: Docstring

First line inside a function (or class or module) should be a string with documentation.

def book_section( request ):
    """book_section(request) -> URL path as tuple.
    Or returns None if request does not start
    with '/book'.

    :param request: the Apache log request field
    :returns: list with path from the URL or None.
    """

Uses ReStructured Text (RST) Markup.

Documentation from Docstrings

In REPL

>>> help( book_section )

Also, format the docstrings with tools

Using a Function

>>> some_req
'GET /book/python-2.6/html/p02/p02c03_tuples.html HTTP/1.1'
>>> book_section(some_req)
['python-2.6', 'html', 'p02', 'p02c03_tuples.html']

>>> another_req
'GET /homepage/books/oodesign/build-python/html/roulette/player.html HTTP/1.1'
>>> book_section(another_req)
>>>

No return value? Actually return value of None. Nothing printed.

Function Return Values

Explicit

Implicit

More Function Goodness

Keyword argument values.

print("words", file=sys.stderr)

Makes it more clear what's going on.

Handy for optional parameters.

Default/Optional Parameters

def section( request, root_name="book" ):
    """book_section(request) -> URL path."""
    method, req_path, protocol = request.split()
    path = req_path.split('/')
    if path[1] == root_name:
        return path[2:]

Default value (root_name="book")

Using Default Parameters

book_section = section( request )
book_section = section( request, "book" )
blog_section = section( request, "web" )

Default Value Rule

Warning

Do Not Use Mutable Objects as Defaults

You have been warned.

Do Not Do This

def some_function( positional, another=[] ):
    some code
    another.append( positional )
    maybe more code

The default value for another, [], is created just once.

That single mutable list object will reused.

Avoid default values that are list, dict or set.

Might not do what you wanted.

Generator Functions

The yield statement makes a profound change to a function.

It creates a function which is iterable.

Cleaned Lines Generator

def cleaned_lines( some_source ):
    for line in some_source:
        fields = format_pat.match(line).groups()
        method, uri, protocol = fields[4].split()
        yield fields[0], fields[3], uri.split('/')[1:]

How We Use A Generator

>>> with open(path,'r') as source:
>>>     for host, time, uri in cleaned_lines(source):
>>>         print( host, time, uri )
('157.55.18.26', '01/Mar/2012:22:07:15 -0500',
['homepage', 'books', 'oodesign', 'build-python', 'html', 'roulette', 'player.html'])

Part 4

Look briefly at the Python library of Modules

See http://xkcd.com/353/

Or, better yet,

>>> import antigravity

It does work.

Modules

Simple Script

#!/usr/bin/env python3.2
"""Docstring"""
import datetime
import re
format_pat= re.compile( r" details omitted " )
path= "itmaybeahack.com.bkup-Mar-2012"
with open(path,'r') as source:
    for line in source:
        match= format_pat.match(line)
        etc.

Only the real work. But. It's not very reusable.

Define Then Reuse

#!/usr/bin/env python3.2
"""Docstring"""
import datetime
import re
format_pat= re.compile( r" details omitted " )
def analyze( path ):
    etc.

def main():
    analyze( "itmaybeahack.com.bkup-Mar-2012" )

main()

Library Module -- Encourages Reuse

#!/usr/bin/env python3.2
"""Docstring"""
import datetime
import re
class Access:
    etc.
def analyze( path ):
    etc.

Main-Import Switch

if __name__ == "__main__":
    main()

Python Library

There are hundreds. Hundreds

Learn the library, it will save tons of work. Tons

It's probably already been done. Don't reinvent.

Decimal

Don't use float for currency:

>>> 2.35*1.07
2.5145000000000004

Use decimal:

>>> from decimal import Decimal
>>> Decimal( '2.35' )
Decimal('2.35')
>>> Decimal('2.35')*Decimal('1.07')
Decimal('2.5145')

Conclusion

What have we seen?

Application Areas?

Nearly Infinite. We started with log scraping because it's easy and simple.

Yes, but

It's interpreted.

Isn't that slow?

Potentially.

If it's a problem, write an extension module in C++.

Only optimize the 20% that's slow.

Leave the 80% in Python.

Other Architectures

Nearly Infinite.

Desktop GUI? PyGTK or PyQt or WxWidgets or ...

Web Sites? Django or Werkzeug or web2py or ...

Questions?