Follow Techotopia on Twitter

On-line Guides
All Guides
eBook Store
iOS / Android
Linux for Beginners
Office Productivity
Linux Installation
Linux Security
Linux Utilities
Linux Virtualization
Linux Kernel
System/Network Admin
Programming
Scripting Languages
Development Tools
Web Development
GUI Toolkits/Desktop
Databases
Mail Systems
openSolaris
Eclipse Documentation
Techotopia.com
Virtuatopia.com

How To Guides
Virtualization
General System Admin
Linux Security
Linux Filesystems
Web Servers
Graphics & Desktop
PC Hardware
Windows
Problem Solutions

  




 

 

Creating a Regular Expression

There are a lot of options and clauses that can be used to create regular expressions. We can't pretend to cover them all in a single chapter. Instead, we'll cover the basics of creating and using RE's. The full set of rules is given in section 4.2.1 Regular Expression Syntax of the Python Library Reference document. Additionally, there are many fine books devoted to this subject.

  • Any ordinary character, by itself, is an RE. Example: "a" is an RE that matches the character a in the candidate string. While trivial, it is critical to know that each ordinary character is a stand-alone RE.

    Some characters have special meanings. We can escape that special meaning by using a \ in front of them. For example, * is a special character, but \* escapes the special meaning and creates a single-character RE that matches the character *.

    Additionally, some ordinary characters can be made special with \. For instance \d is any digit, \s is any whitespace character. \D is any non-digit, \S is any non-whitespace character.

  • The character . is an RE that matches any single character. Example: "x.z" is an RE that matches the strings like xaz or xbz, but doesn't match strings like xabz.

  • The brackets, "[...]", create a RE that matches any character between the [ ]'s. Example: "x[abc]z" matches any of xaz, xbz or xcz. A range of characters can be specified using a -, for example "x[1-9]z". To include a -, it must be first or last. ^ cannot be first. Multiple ranges are allowed, for example "x[A-Za-z]z". Here's a common RE that matches a letter followed by a letter, digit or _: "[A-Za-z][A-Za-z1-9_]"

  • The modified brackets, "[^...]", create a regular expression that matches any character except those between the [ ]'s. Example: "a[^xyz]b" matches strings like a9b and a$b, but don't match axb. As with [ ], a range can be specified and multiple ranges can be specified.

  • A regular expression can be formed from concatenating regular expressions. Example: "a.b" is three regular expressions, the first matches a, the second matches any character, the third matches b.

  • A regular expression can be a group of regular expressions, formed with ()'s. Example: "(ab)c" is a regular expression composed of two regular expressions: "(ab)" (which, in turn, is composed of two RE's) and "c". ()'s also group RE's for extraction purposes. The elements matched within ()'s are remembered by the regular expression processor and set aside in a match object.

  • A regular expression can be repeated. Several repeat constructs are available: "x*" repeats "x" zero or more times; "x+" repeats "x" 1 or more times; "x?" repeats "x" zero or once. Example: "1(abc)*2" matches 12 or 1abc2 or 1abcabc2, etc. The first match, against 12, is often surprising; but there are zero copies of abc between 1 and 2.

  • The character "^" is an RE that only matches the beginning of the line, "$" is an RE that only matches the end of the line. Example: "^$" matches a completely empty line.

Here are some examples.

"[_A-Za-z][_A-Za-z1-9]*"

Matches a Python identifier. This embodies the rule of starting with a letter or _, and containing any number of letters, digits or _'s. Note that any number includes 0 occurances, so a single letter or _ is a valid identifier.

"^\s*import\s"

Matches a simple import statement. It matches the beginning of the line with ^, zero or more whitespace characters with \s*, the sequence of letters import; and one more whitespace character. This pattern will ignore the rest of the line.

"^\s*from\s+[_A-Za-z][_A-Za-z1-9]*\s+import\s"

Matches a from module import statement. As with the simple import, it matches the beginning of the line (^), zero or more whitespace characters (\s*), the sequence of letters from, a Python module name, one or more whitespace characters (\s+), the sequence import, and one more whitespace character.

"(\d+):(\d+):(\d+\.?\d*)"

Matches a one or more digits, a :, one or more digits, a :, and digits followed by optional . and zero or more other digits. For example 20:07:13.2 would match, as would 13:04:05 Further, the ()'s would allow separating the digit strings for conversion and further processing.

"def\s+([_A-Za-z][_A-Za-z1-9]*)\s+\([^)]*\):"

Matches Python function definition lines. It matches the letters def; a string of 1 or more whitespace characters (\s); an identifier, surrounded by ()'s to capture the entire identifier as a match. It matches a (; we've used \( to escape the meaning of ( and make it an ordinary character. It matches a string of non-) characters, which would be the parameter list. The parameter list ends with a ); we've used \) to make escape the meaning of ) and make it an ordinary character. Finally, we need tyo see the :.


 
 
  Published under the terms of the Open Publication License Design by Interspire