Creating a Regular Expression

There are a lot of options and clauses that can be used to create regular expressions. We can't pretend to cover them all in a single chapter. Instead, we'll cover the basics of creating and using RE's. The full set of rules is given in section 4.2.1 Regular Expression Syntax of the Python Library Reference document. Additionally, there are many fine books devoted to this subject.

  • Any ordinary character, by itself, is an RE. Example: "a" is an RE that matches the character a in the candidate string. While trivial, it is critical to know that each ordinary character is a stand-alone RE.

    Some characters have special meanings. We can escape that special meaning by using a \ in front of them. For example, * is a special character, but \* escapes the special meaning and creates a single-character RE that matches the character *.

    Additionally, some ordinary characters can be made special with \. For instance \d is any digit, \s is any whitespace character. \D is any non-digit, \S is any non-whitespace character.

  • The character . is an RE that matches any single character. Example: "x.z" is an RE that matches the strings like xaz or xbz, but doesn't match strings like xabz.

  • The brackets, "[...]", create a RE that matches any character between the [ ]'s. Example: "x[abc]z" matches any of xaz, xbz or xcz. A range of characters can be specified using a -, for example "x[1-9]z". To include a -, it must be first or last. ^ cannot be first. Multiple ranges are allowed, for example "x[A-Za-z]z". Here's a common RE that matches a letter followed by a letter, digit or _: "[A-Za-z][A-Za-z1-9_]"

  • The modified brackets, "[^...]", create a regular expression that matches any character except those between the [ ]'s. Example: "a[^xyz]b" matches strings like a9b and a$b, but don't match axb. As with [ ], a range can be specified and multiple ranges can be specified.

  • A regular expression can be formed from concatenating regular expressions. Example: "a.b" is three regular expressions, the first matches a, the second matches any character, the third matches b.

  • A regular expression can be a group of regular expressions, formed with ()'s. Example: "(ab)c" is a regular expression composed of two regular expressions: "(ab)" (which, in turn, is composed of two RE's) and "c". ()'s also group RE's for extraction purposes. The elements matched within ()'s are remembered by the regular expression processor and set aside in a match object.

  • A regular expression can be repeated. Several repeat constructs are available: "x*" repeats "x" zero or more times; "x+" repeats "x" 1 or more times; "x?" repeats "x" zero or once. Example: "1(abc)*2" matches 12 or 1abc2 or 1abcabc2, etc. The first match, against 12, is often surprising; but there are zero copies of abc between 1 and 2.

  • The character "^" is an RE that only matches the beginning of the line, "$" is an RE that only matches the end of the line. Example: "^$" matches a completely empty line.

Here are some examples.

"[_A-Za-z][_A-Za-z1-9]*"

Matches a Python identifier. This embodies the rule of starting with a letter or _, and containing any number of letters, digits or _'s. Note that any number includes 0 occurances, so a single letter or _ is a valid identifier.

"^\s*import\s"

Matches a simple import statement. It matches the beginning of the line with ^, zero or more whitespace characters with \s*, the sequence of letters import; and one more whitespace character. This pattern will ignore the rest of the line.

"^\s*from\s+[_A-Za-z][_A-Za-z1-9]*\s+import\s"

Matches a from module import statement. As with the simple import, it matches the beginning of the line (^), zero or more whitespace characters (\s*), the sequence of letters from, a Python module name, one or more whitespace characters (\s+), the sequence import, and one more whitespace character.

"(\d+):(\d+):(\d+\.?\d*)"

Matches a one or more digits, a :, one or more digits, a :, and digits followed by optional . and zero or more other digits. For example 20:07:13.2 would match, as would 13:04:05 Further, the ()'s would allow separating the digit strings for conversion and further processing.

"def\s+([_A-Za-z][_A-Za-z1-9]*)\s+\([^)]*\):"

Matches Python function definition lines. It matches the letters def; a string of 1 or more whitespace characters (\s); an identifier, surrounded by ()'s to capture the entire identifier as a match. It matches a (; we've used \( to escape the meaning of ( and make it an ordinary character. It matches a string of non-) characters, which would be the parameter list. The parameter list ends with a ); we've used \) to make escape the meaning of ) and make it an ordinary character. Finally, we need tyo see the :.