Chapter 31. Complex Strings: the re Module

There are a number of related problems when processing strings. When we get strings as input from files, we need to recognize the input as meaningful. Once we're sure it's in the right form, we need to parse the inputs, sometimes we'll ahve to convert some parts into numbers (or other objects) for further use.

For example, a file may contain lines which are supposed to be like "Birth Date: 3/8/85". We may need to determine if a given string has the right form. Then, we may need to break the string into individual elements for date processing.

We can accomplish these recognition, parsing and conversion operations with the re module in Python. A regular expression (RE) is a rule or pattern used for matching strings. It differs from the fairly simple “wild-card” rules used by many operating systems for naming files with a pattern. These simple operating system file-name matching rules are embodied in two simpler packages: fnmatch and glob.

We'll look at the semantics of a regular expression in the section called “Semantics”. We'll look at the syntax for defining a RE in the section called “Creating a Regular Expression”. In the section called “Using a Regular Expression” we'll put the regular expression to use.

Semantics

One way to look at regular expressions is as a production rule for constructing strings. In principle, such a rule could describe an infinite number of strings. The real purpose is not to enumerate all of the strings described by the production rule, but to match a candidate string against the production rule to see if the rule could have constructed the given string.

For example, a rule could be "aba". All strings of the form "aba" would match this simple rule. This rule produces only a single string. Determining a match between a given string and the one string produced by this rule is pretty simple.

A more complex rule could be "ab*a". The b* means zero or more copies of b. This rule produces an infinite set of strings including "aa", "aba", "abba", etc. It's a little more complex to see if a given string could have been produced by this rule.

The Python re module includes Python constructs for creating regular expressions (REs), matching candidate strings against RE's, and examining the details of the substrings that match. There is a lot of power and subtlety to this package. A complete treatment is beyond the scope of this book.