The re module is the core of text processing. The re module provides sophisticated ways to create and use regular expressions. A regular expression is a kind of formula that specifies patterns in text strings.
The name “regular expression” comes from an earlier mathematical treatment of “regular sets”. For our purposes, the set theory will be boiled away. We’re still stuck with the phrase.
Regular expressions give us a simple way to specify a set of related strings by describing the pattern they have in common. We write a pattern to summarize some set of matching strings. This pattern string can be compiled into an object that efficiently determines if and where a given string matches the pattern.
For example the pattern ab.* describes the set of words that includes “abacterial” “abandoners” and “abandoning” among many others. Looking at it from the other direction, the string “abashments” matches the pattern, where the phrase “academical” does not match the pattern.
In this chapter, we’ll look at why this pattern matching even matters in How Does Pattern Matching Help Us? Then we’ll look at how we write these regular expression patterns in How To Create Patterns Using Regular Expressions. Once we have the pattern, we’ll look at how we use it in Objects We Use For Pattern Matching and Parsing.
We’ll look at a big example in Some Examples.
There are a number of common problems that we have to solve when processing strings. When we get strings as input from files, as a response to raw_input(), or from a GUI, we often need to look at the string as meaningful groups of characters, not as individual characters.
Since a str is a sequence, we’re limited to doing things with single characters or simple slices. Without a lot of fancy footwork, we’re limited to simple contiguous substrings at fixed positions.
What if we need some flexibility? It is very helpful to allow people to type variable amounts of whitespace – spaces, tabs, etc. – in a file. Also, people like some flexibility in entering numbers. We don’t want to force them to type “03/08/1987” with those silly-looking ‘0’s in front. We’d like to accept “03/08/1987” as gracefully as “3/8/1987”.
Processing text can take one of three common forms.
For example, a file may contain lines like "Birth Date: 3/8/87" or "Birth Date: 12/02/87". When we’re reading lines like these from the file, we may do any of the following.
We can accomplish these matching, searching and parsing operations with the re module in Python. A regular expression (RE) is a rule or pattern used for matching, searching and parsing strings.
The Filename Wildcard. The fairly simple “wild-card” filename matching rules are kinds of regular expressions, also. These rules are embodied in two packages that we looked at in Additional File-Related Modules: fnmatch and glob.
The filename regular expressions in fnmatch and glob use special characters that don’t have their usual literal meaning. When we write a glob pattern, characters simply match themselves. However, the * character matches any sequence of characters in a file name. The ? character matches any single character in a file name.
The re module provides considerably more sophisticated pattern matching capabilities than these simple rules. It uses the same principle: some punctuation marks have special meanings as part of pattern specification.
File Searching. An example program which does this is called grep. This is a GNU/Linux application program; the name means Global Regular Expression Print. (Windows users may be familiar with the findstr DOS command, which does approximately the same thing.)
The grep (or findstr) program reads one or more files, searches for lines that match a given regular expression and prints the matching lines.
Using Regular Expressions. The general recipe for using regular expressions in your program is the following.
Be sure to import re.
Define the pattern string. We write patterns as string constants in our program.
Evaluate the re.compile() function to create a re.Pattern object. This re.compile() function is a factory that creates usable pattern objects from our original pattern strings. The pattern object will do the real work of matching a target string against your regular expression.
Usually we combine the pattern and the compile.
>>> date_pattern = re.compile( "Birth Date: +(.*)" )
Use the re.Pattern object to match or search the candidate strings. The result of a successful match or search will be a re.Match object. In a sense, the Pattern object is a factory that creates Match objects from string input.
>>> match = date_pattern.match( "Should Not Match" ) >>> match >>> match = date_pattern.match( "Birth Date: 3/8/87" ) >>> match <_sre.SRE_Match object at 0x82e60>
When a string doesn’t match the pattern, the pattern object returns None (which is equivalent to False.)
A successful match creates a re.Match object; and any object is equivalent to True. We can use the match object in an if statement.
Use the information in the re.Match object to parse the string. The match object is only created for a successful match, and provides the details to help us work with the original string.
>>> match.group() 'Birth Date: 3/8/87' >>> match.group(1) '3/8/87' >>> match.groups() ('3/8/87',)
Pattern as Production Rule. One way to look at a regular expression is as a production rule for constructing strings. You can think of the pattern as the rule producing a giant collection of all possible strings.
When you use the pattern to matching a target, you’re looking for your target string in that giant set of possibilities.
As a practical matter, the Regular Expression module doesn’t actually enumerate all of the strings that a pattern describes. The set of possible strings could be infinite.
Pragmatically, the match algorithm looks at each clause of your regular expression pattern and locates the matching characters in the candidate string. If the next character in the string matches the next clause in the regular expression rule, the algorithm goes forward. When the algorithm runs out of clauses in the pattern, it has found a match.
In many cases, a clause in the pattern will have alternatives. In this case, the algorithm places bookmarks in the target string and pattern at each alternative choice. If the next character in the target string doesn’t match the next clause in the pattern, then the algorithm backtracks and tries a different choice in the pattern. In this way, the matching tries out the various alternatives in the pattern, looking for some way to match the entire pattern against the string.
For example, a Regular Expression pattern could be "aba". This production rule describes a string created from a, followed by b, followed by a. This simple rule only builds one possible string; consequently, only candidate strings containing the exact sequence "aba" will be found by the pattern’s match() method.
A more complex RE pattern could be "ab*a". This production rule describes a string created from a, followed by any number of b‘s, followed by a. This describes an infinite set of strings including "aa", "aba", "abba", etc. Note that the phrase “any number of” includes zero. That’s why "aa" matches: it has zero b‘s.
Note that the * character means “repeat the previous RE”. This is different from the way fnmatch works. We’ll explore the special characters in the re module in detail in the next section.
We’ll cover the basics of creating and using RE’s in this section. The full set of rules is given in section 4.2.1 of the Python Library Reference document [PythonLib]. This is a deep subject, and you can find several books that will help you unlockl the secrets of regular expressions.
To understand the rules, we have to make a distinction between ordinary characters and special characters. Most characters like letters and numbers are ordinary, they match what they appear to mean. For example, an x in a pattern matches the letter x, nothing more.
Some characters, however, have special meanings. Mostly these are punctuation marks; they don’t match a character, but they are a pattern or a modification to a previous pattern. For example, a . in a pattern doesn’t match the period character, it matches any single character.
But what if we want to match a .? We must escape that special meaning by using a \ in front of the character. For example, \. escapes the special meaning that . normally has; it creates a single-character RE that matches only the character ..
Additionally, some ordinary characters can be made special with the escape character, \. For instance \d does mot match d, it matches any digit; \s does not match s it matches any whitespace character.
Any ordinary character, by itself, is a RE. Example: "a" is a RE that matches the character a in the candidate string. While trivial, it is critical to know that each ordinary character is a stand-alone RE. A special character is an RE when it is escaped with \. For example, . and * are special characters, but \. and \* are simple one-character RE’s.
The special character . is a RE that matches any single character. Example: "x.z" is a RE that matches the strings like "xaz" or "x9z", but doesn’t match strings like "xabz" or "xz".
The special characters  create a RE that matches any one of the characters in a set defined by the characters in the ’s. Example: "x[abc]z" matches any of "xaz", "xbz" or "xcz".
A range of characters can be specified using a -. The character before and after the - must be in proper order. For example "x[1-9]z".
Multiple ranges are allowed, for example "x[A-Za-z]z".
Here’s a common RE that matches a letter followed by a letter, digit or _: "[A-Za-z][A-Za-z0-9_]".
To include a -, it must be the first or last character in the [ ]‘s. If - is not first or last, then it indicates a range or characters.
A ^ must not be the first character in the [ ]‘s. If ^ is first, it modifies the meaning of the [ ]‘s.
Some common sets of characters have shorter names. [0-9] can be abbreviated \d (d for digits). [ \t\n\r\f\v] can be abbreviated \s (s for space). [a-zA-Z0-9_] can be abbreviated \w (w for word).
The special character ^ modifies the brackets, [^...]. This creates an RE that matches any character except those between the [ ]‘s. Example: "a[^xyz]b" matches strings like "a9b" and "a$b", but don’t match "axb". As with [ ], a range can be specified and multiple ranges can be specified.
To include a -, it must be the first or last character in the [ ]‘s.
Some common sets of characters have shorter names. \D (D for non-digits) is the same as [^0-9], the opposite of \d. \S (S for non-space) is the same as [^ \t\n\r\f\v], the opposite of \s. \W (W for non-word) is the same as [^a-zA-Z0-9_], the opposite of \w.
An RE can be formed from concatenating RE’s. Example: "a.b" is three regular expressions, the first matches a, the second matches any character, the third matches b. While this may seem obvious, it’s a necessary rule that helps us figure out which RE’s are modified by the * or | operators.
This is perhaps the most important rule for defining regular expressions. This rule tells us that we can put any number of one-part regular expressions together in a sequence to make a new, longer RE.
Note that there’s no special character that puts RE’s together; the sequence of RE’s is implied. This is similar to the way mathematicians imply multiplication by writing symbols next to each other. For example, means .
An RE can be a group of RE’s with ( ). This creates a single RE that is composed of multiple parts. Also, this defines parts of the string that will be captured by the Match object.
Example: "(ab)c" is a regular expression composed of two regular expressions: "(ab)" (which, in turn, is composed of two RE’s) and "c". This matches the string "abc". This grouping is used with the repetition operators (*, +, ?) shown below, and the alternative operator, |.
( ) also identify RE’s for parsing purposes. The elements matched within ( ) are remembered by the regular expression processor and set aside in the resulting Match object. By saving matched characters, we can decompose a string into useful groups.
An RE can be repeated using * , + or ? Several repeat constructs are available: "x*" repeats x zero or more times; "y+" repeats y 1 or more times; "z?" repeats z zero or once, it makes the previous RE optional.
Example: "1(abc)*2" matches "12" (zero copies of abc) or "1abc2" or "1abcabc2", etc. Since the (abc) part of this pattern uses ()‘s, the sequence of expressions is repeated as a whole. The first match, against "12", is often surprising; but there are zero copies of abc between 1 and 2.
Example: "1[abc]*2" matches "12" or "1a2" or "1b2" or "1abacab2", etc. Since the [abc] part of this pattern uses ‘s, any one of the characters in the  will match. The first match, against "12", is often surprising; but there are zero instances of of the characters abc between 1 and 2.
Two RE’s are alternatives, using |. The alternative construct allows you to combine a number of different rules into a single pattern. For example, you might have two allowed forms for dates: mm/dd/yyyy or dd-mon-yyyy. You might write the following pattern: (d+/d+/d+)|(d+-w+-d+) to match either alternative.
The character ^ is an RE that only matches the beginning of the line.
The chacaters $ is an RE that only matches the end of the line.
Example: "^$" matches a completely empty line.
Match Some Dates. Here’s an example of a pattern to match two different kinds of dates. We’ll use the re.compile() function to build a pattern from our original pattern specification string. Once we have this pattern, we’ll use it to match a number of candidate strings. We’ll examine the groups which matched successfully to see what happened.
>>> import re >>> p = "(\d+/\d+/\d+)|(\d+-[a-zA-Z]+-\d+)" >>> pat = re.compile( p ) >>> pat.match( "9/10/5" ) <_sre.SRE_Match object at 0x68d58> >>> _.groups() ('9/10/5', None) >>> pat.match( "10-sep-56" ) <_sre.SRE_Match object at 0x68d58> >>> _.groups() (None, '10-sep-56') >>> pat.match( "hi mom" )
Match A Property File Line. This pattern matches the kind of line that is often found in a properties file or a configuration file.
There are six regular expressions:
Match a Time. Here’a pattern to match various kinds of times with hh:mm:ss and hh:mm:ss.sss formats.
This pattern matches a one or more digits with (\d+), a :, one or more digits, a :, and digits followed by optional . and zero or more other digits. For example "20:07:13.2" would match, as would "13:04:05" Further, the ()‘s would allow separating the digit strings for conversion and further processing. Again, the punctuation marks are quietly dropped, since we only want to process the numbers.
A Python Identifier. This is a pattern which defines a Python identifier.
This embodies the rule of starting with a letter or _, and containing letters, digits or _‘s.
The pattern above matches a Python import statement. It matches the beginning of the line with ^; it matches zero or more whitespace characters with \s*; it matches the sequence of letters import; it matches one more whitespace character, and ignores the rest of the line.
There are several processing steps that we use with regular expressions. As we showed in the processing recipe above, the most common first step is to compile the RE definition string to make a Pattern object. This object can then be used to match or search candidate strings. A successful match returns a Match object with details of the matching substring.
He’s the formal definition for the re.compile() function of the re package. This translates an RE string into a Pattern object that can be used to search a string or match a string.
Create a Pattern object from an RE string. The object that results is for use in searching or matching; it has several methods, including match() and search().
The following example shows the pattern r"(dd):(dd)" which should match strings which have two digits, a :, and two digits. We’ll match the candidate string "23:59", which produces a Match object. When we try to match the string "hi mom", we get result of None.
>>> import re >>> hhmm_pat= re.compile( r"(\d\d):(\d\d)" ) >>> hhmm_pat.match( "23:59" ) <_sre.SRE_Match object at 0x68d58> >>> _.groups() ('23', '59') >>> hhmm_pat.match( "hi mom" )
There are some other options available for re.compile(), see the Python Library Reference, [PythonLib] section 4.2, for more information.
The raw string notation (r"pattern") is generally used to simplify the \‘s required. Without the raw notation, each \ in the string would have to be escaped by a \, making it \\. This rapidly gets cumbersome.
Confusing Class Names
As you work though the various examples, you’ll see that the type() claims the object class names are SRE_Pattern and SRE_Match. We’ve fudged the class names in the book to make the explanation simpler. Also, in the future, there may be other, alternative RE packages, and the class names may be slightly different.
When we say import re, clearly something in the re module is then importing and using a module name _sre.
We don’t need to know much more than this. That’s why the names don’t precisely match what we think they should say based on other, simpler, Python modules.
The following methods are part of a compiled Pattern. Assume that we assigned the pattern to the variable pattern, via a statement like pat = re.compile....
If search() or match() find the pattern in the candidate string, a Match object is created to describe the match. The following methods are part of a Match object; we’ll use the variable name match.
Retrieve the string that matched a particular () grouping in the regular expression. Group zero is a tuple of everything that matched. Group 1 is the material that matched the first set of ()‘s.
If you ask for more than one group, a tuple is returned with the matching sting from all of the requested groups.
>>> import re >>> hhmm_pat= re.compile( r"(\d\d):(\d\d)" ) >>> match = hhmm_pat.match( "23:59" ) >>> match.group(1,2) ('23', '59')
Debugging Regular Expressions
If you forget to import the module, then you get NameError on every class, function or variable reference.
If you spell the name wrong on your import statement, or the module isn’t on your Python Path, you’ll get an ImportError. First, be sure you’ve spelled the module name correctly. If you import sys and then look at sys.path, you can see all the places Python look for the module. You can look in each of those directories to see that the files are named.
There are two large problems that can cause problems with regular expressions: getting the regular expression wrong and getting the processing wrong.
The regular expression language, with it’s special characters, escapes, and heavy use of \ is rather difficult to learn. If you get error exceptions from re.compile, then your RE pattern is improper. For example error: multiple repeat means that your RE is misusing "*" characters. There are a number of these errors which indicate that you are likely missing a \ to escape the special meaning of one or more characters in your pattern.
If you get TypeError errors from match() or search(), then you have not used a candidate string with your pattern. Once you’ve compiled a pattern with pat= re.compile("some pattern"), you use that pattern object with candidate strings: matching= pat.match("candidate"). If you try pat.match(23), 23 isn’t a string and you get a TypeError.
Beyond these very visible problems are the more subtle problem with a pattern that doesn’t match what you think it should match. We’ll look at this separately, in More Debugging Hints.
In Debugging Regular Expressions we talked a bit about debugging. Beyond these very visible problems are the more subtle problem with a pattern that doesn’t match what you think it should match. It helps to have example strings that are supposed to match, and example strings that are not supposed to match. You can then construct simple test scripts like the following.
import re pat= re.compile( r"\d+" ) assert pat.match( "2" ) assert pat.match( "1234565.789" ) assert not pat.match( "a" )
If your parsing isn’t working, then a test script like the following helps to debug the patterns so you can see what is matching and being parsed and what is being ignored.
import re pat= re.compile( r"(\d+):(\d+)" ) m= pat.match( "23:59" ) assert m.groups() == ('23','59') m= pat.match( "1234565:78.9" ) assert m.groups() == ('1234565','78.9') assert not pat.match( "a" )
In this last example, you’ll note that our pattern matched digits, but our test data included a .. Either our test is wrong, or our pattern is wrong. This is the art of debugging: what was really supposed to happen? Did it happen?
In this case, we’ll have to rewrite the pattern to get the test to pass.
Unit Test Framework. This way of testing our patterns is so important, we sometimes create separate modules just for proving that our patterns work. The example shown above with assert statements is just the tip of the iceberg.
The Python unittest module provides a way to create special test modules that exist simply to prove that our software really works intended.
This is beyond the scope of this book, so we’ll stick with simple scripts that use the assert statement.
In Putting Generators To Use we looked at a fairly complex set of string manipulations, done the hard way. These can be redone as regular expressions, leading to a dramatic improvement of this example. In the example, we looked for log entries based on the first four characters being the year, "2003" in that example. We can now improve that example to use a regular expression to examine each line.
Here’s a snippet of a log file that we want to analyze. Note that it has some line with dates, and other lines with junk that we want to skip.
log= """ 2003-07-28 12:46:42,843 INFO [main]  - ------------------------------------------------------------------- XYZ Management Console initialized at: Mon Jul 28 12:46:42 EDT 2003 Package Build: 452 ------------------------------------------------------------------- 2003-07-28 12:46:50,109 INFO [main]  - Export directory does not exist 2003-07-28 12:46:50,109 INFO [main]  - Export directory created successfully 2003-07-28 12:46:50,125 INFO [main]  - Starting Coyote HTTP/1.1 on port 9842 2003-07-28 12:57:14,046 INFO [Thread-11]  - request.getRequestURI =... 2003-07-28 12:57:18,875 INFO [Thread-11] [admin] - Logged in 2003-07-28 12:57:19,625 INFO [Thread-11]  - request.getRequestURI =... """
This sequence decodes a complex input value into individual fields and then computes a single result.
>>> import re >>> datePat= re.compile("(\d\d\d\d)-(\d\d)-(\d\d)") >>> logLine = "2003-07-28 12:46:50,109 INFO [main]  - Export directory does not exist" >>> dateMatch= datePat.match( logLine ) >>> dateMatch.group( 0, 1, 2, 3 ) ('2003-07-28', '2003', '07', '28') >>> y,m,d= map( int, dateMatch.group(1,2,3) ) >>> import datetime >>> lineDate= datetime.date( y, m, d ) >>> lineDate datetime.date(2003, 7, 28)
The first import statement incorporates the re module.
The datePat variable is the compiled Pattern object which matches three numbers, using (dddd) or (dd), separated by -‘s. This matches the log date stamp very precisely: a four-digit number, followed by two two-digit numbers.
The digit-sequence RE’s are surround by ()‘s so that the material that matches is returned as a group. A Match object will have four groups: group 0 is everything that matched, groups 1, 2, and 3 are successive digit strings.
The logLine variable is sample input, read from our log file. Typically, this will be one line of input read inside a for loop.
The dateMatch variable is a Match object that indicates success or failure in matching. If dateMatch is None, no match occurred. Otherwise, the dateMatch.group() method will reveal the individually matched input items.
dateMatch.group shows the various groups that are available in the Match object. Group 0 is the entire match. Groups 1, 2 and 3 are the various elements of the date.
Setting y, m, and d involves a number of steps. First we use dateMatch.group() to create a tuple of requested items. Each item in the tuple will be a string. Second, the map() function is used to apply the built-in int() function against each string to create a tuple of three numbers. Finally, this statement relies on the multiple-assignment feature to set all three variables at once.
Finally, lineDate is computed as the a datetime.date object with the given year, month and day values.
Extend the Log Processing.
Extend the example pattern for analyzing log records. In he example above, it matches just the date; expand it to match date and time. Change the result to be a datetime.datetime object.
You can revisit the example in Putting Generators To Use and do more sophisticated date and time processing on the log entries. This is because you can now compare the log entry to a start or stop time. You can also compute the time between log entries.
Parse Stock prices.
Create a function that will decode the old-style fractional stock price. The price can be a simple floating-point number or it can be a fraction, for example, 4 5/8.
Develop two patterns, one for numbers with optional decimal places and another for a number with a space and a fraction. Write a function that accepts a string and checks both patterns, returning the correct decimal price for whole numbers (e.g., 14), decimal prices (e.g., 5.28) and fractional prices (27 1/4).
Create a function that will decode a few common American date formats. For example, 3/18/87 is March 18, 1987. You might want to do 18-Mar-87 as an alternative format. Stick to two or three common formats; otherwise, this can become quite complex.
Develop the required patterns for the candidate date formats. Write a function that accepts a string and checks each of your patterns, looking for the first one that works. It will return the date as a tuple of ( year, month, day ).
In some earlier exercises (Class Definition Exercises) we glossed over the date processing to evaluate our stock portfolio. You can use this do add the neccessary date parsing.
You can read about this in Wikipedia. The mathematics behind regular expressions are based on Kleene’s theories of regular sets. The name regular expressions comes from the expressions that describe the regular sets. The sets contain all of the strings matched by the expressions.
While the name isn’t descriptive, we’re stuck with it. Worse, the RE package has features that are not part of Kleene’s original mathematics, making it do more than the formal definition of regular expressions.
Your perception is correct, that the RE module doesn’t do anything new. While RE’s can be hard to learn, the time invested pays handsome dividends. Once you get the hang of writing RE’s, your programs are simpler than the equivalent program done with string methods.
In some cases, however, the string methods ( split(), specifically) are simpler than regular expressions. The decision is based on what gives you a simpler, more reliable, more readable program.