Follow Techotopia on Twitter

On-line Guides
All Guides
eBook Store
iOS / Android
Linux for Beginners
Office Productivity
Linux Installation
Linux Security
Linux Utilities
Linux Virtualization
Linux Kernel
System/Network Admin
Programming
Scripting Languages
Development Tools
Web Development
GUI Toolkits/Desktop
Databases
Mail Systems
openSolaris
Eclipse Documentation
Techotopia.com
Virtuatopia.com
Answertopia.com

How To Guides
Virtualization
General System Admin
Linux Security
Linux Filesystems
Web Servers
Graphics & Desktop
PC Hardware
Windows
Problem Solutions
Privacy Policy

  




 

 

Thinking in Java
Prev Contents / Index Next

Pattern and Matcher

As a first example, the following class can be used to test regular expressions against an input string. The first argument is the input string to match against, followed by one or more regular expressions to be applied to the input. Under Unix/Linux, the regular expressions must be quoted on the command line.

This program can be useful in testing regular expressions as you construct them to see that they produce your intended matching behavior.

//: c12:TestRegularExpression.java
// Allows you to easly try out regular expressions.
// {Args: abcabcabcdefabc "abc+" "(abc)+" "(abc){2,}" }
import java.util.regex.*;

public class TestRegularExpression {
  public static void main(String[] args) {
    if(args.length < 2) {
      System.out.println("Usage:\n" +
        "java TestRegularExpression " +
        "characterSequence regularExpression+");
      System.exit(0);
    }
    System.out.println("Input: \"" + args[0] + "\"");
    for(int i = 1; i < args.length; i++) {
      System.out.println(
        "Regular expression: \"" + args[i] + "\"");
      Pattern p = Pattern.compile(args[i]);
      Matcher m = p.matcher(args[0]);
      while(m.find()) {
        System.out.println("Match \"" + m.group() +
          "\" at positions " +
          m.start() + "-" + (m.end() - 1));
      }
    }
  }
} ///:~


Regular expressions are implemented in Java through the Pattern and Matcher classes in the package java.util.regex. A Pattern object represents a compiled version of a regular expression. The static compile( ) method compiles a regular expression string into a Pattern object. As seen in the preceding example, you can use the matcher( ) method and the input string to produce a Matcher object from the compiled Pattern object. Pattern also has a

static boolean ( regex,  input)


for quickly discerning if regex can be found in input, and a split( ) method that produces an array of String that has been broken around matches of the regex.

A Matcher object is generated by calling Pattern.matcher( ) with the input string as an argument. The Matcher object is then used to access the results, using methods to evaluate the success or failure of different types of matches:

boolean matches()
boolean lookingAt()
boolean find()
boolean find(int start)


The matches( ) method is successful if the pattern matches the entire input string, while lookingAt( ) is successful if the input string, starting at the beginning, is a match to the pattern.

find( )

Matcher.find( ) can be used to discover multiple pattern matches in the CharSequence to which it is applied. For example:

//: c12:FindDemo.java
import java.util.regex.*;
import com.bruceeckel.simpletest.*;
import java.util.*;

public class FindDemo {
  private static Test monitor = new Test();
  public static void main(String[] args) {
    Matcher m = Pattern.compile("\\w+")
      .matcher("Evening is full of the linnet's wings");
    while(m.find())
      System.out.println(m.group());
    int i = 0;
    while(m.find(i)) {
      System.out.print(m.group() + " ");
      i++;
    }
    monitor.expect(new String[] {
      "Evening",
      "is",
      "full",
      "of",
      "the",
      "linnet",
      "s",
      "wings",
      "Evening vening ening ning ing ng g is is s full " +
      "full ull ll l of of f the the he e linnet linnet " +
      "innet nnet net et t s s wings wings ings ngs gs s "
    });
  }
} ///:~


The pattern “\\w+” indicates “one or more word characters,” so it will simply split up the input into words. find( ) is like an iterator, moving forward through the input string. However, the second version of find( ) can be given an integer argument that tells it the character position for the beginning of the search—this version resets the search position to the value of the argument, as you can see from the output.

Groups

Groups are regular expressions set off by parentheses that can be called up later with their group number. Group zero indicates the whole expression match, group one is the first parenthesized group, etc. Thus in

A(B(C))D


there are three groups: Group 0 is ABCD, group 1 is BC, and group 2 is C.

The Matcher object has methods to give you information about groups:

public int groupCount( ) returns the number of groups in this matcher's pattern. Group zero is not included in this count.

public String group( ) returns group zero (the entire match) from the previous match operation (find( ), for example).

public String group(int i) returns the given group number during the previous match operation. If the match was successful, but the group specified failed to match any part of the input string, then null is returned.

public int start(int group) returns the start index of the group found in the previous match operation.

public int end(int group) returns the index of the last character, plus one, of the group found in the previous match operation.

Here’s an example of regular expression groups:

//: c12:Groups.java
import java.util.regex.*;
import com.bruceeckel.simpletest.*;

public class Groups {
  private static Test monitor = new Test();
  static public final String poem =
    "Twas brillig, and the slithy toves\n" +
    "Did gyre and gimble in the wabe.\n" +
    "All mimsy were the borogoves,\n" +
    "And the mome raths outgrabe.\n\n" +
    "Beware the Jabberwock, my son,\n" +
    "The jaws that bite, the claws that catch.\n" +
    "Beware the Jubjub bird, and shun\n" +
    "The frumious Bandersnatch.";
  public static void main(String[] args) {
    Matcher m =
      Pattern.compile("(?m)(\\S+)\\s+((\\S+)\\s+(\\S+))$")
        .matcher(poem);
    while(m.find()) {
      for(int j = 0; j <= m.groupCount(); j++)
        System.out.print("[" + m.group(j) + "]");
      System.out.println();
    }
    monitor.expect(new String[]{
      "[the slithy toves]" +
      "[the][slithy toves][slithy][toves]",
      "[in the wabe.][in][the wabe.][the][wabe.]",
      "[were the borogoves,]" +
      "[were][the borogoves,][the][borogoves,]",
      "[mome raths outgrabe.]" +
      "[mome][raths outgrabe.][raths][outgrabe.]",
      "[Jabberwock, my son,]" +
      "[Jabberwock,][my son,][my][son,]",
      "[claws that catch.]" +
      "[claws][that catch.][that][catch.]",
      "[bird, and shun][bird,][and shun][and][shun]",
      "[The frumious Bandersnatch.][The]" +
      "[frumious Bandersnatch.][frumious][Bandersnatch.]"
    });
  }
} ///:~


The poem is the first part of Lewis Carroll’s “Jabberwocky,” from Through the Looking Glass. You can see that the regular expression pattern has a number of parenthesized groups, consisting of any number of non-whitespace characters (‘\S+’) followed by any number of whitespace characters (‘\s+’). The goal is to capture the last three words on each line; the end of a line is delimited by ‘$’. However, the normal behavior is to match ‘$’ with the end of the entire input sequence, so we must explicitly tell the regular expression to pay attention to newlines within the input. This is accomplished with the ‘(?m)’ pattern flag at the beginning of the sequence (pattern flags will be shown shortly).

start( ) and end( )

Following a successful matching operation, start( ) returns the start index of the previous match, and end( ) returns the index of the last character matched, plus one. Invoking either start( ) or end( ) following an unsuccessful matching operation (or prior to a matching operation being attempted) produces an IllegalStateException. The following program also demonstrates matches( ) and lookingAt( ):

//: c12:StartEnd.java
import java.util.regex.*;
import com.bruceeckel.simpletest.*;

public class StartEnd {
  private static Test monitor = new Test();
  public static void main(String[] args) {
    String[] input = new String[] {
      "Java has regular expressions in 1.4",
      "regular expressions now expressing in Java",
      "Java represses oracular expressions"
    };
    Pattern
      p1 = Pattern.compile("re\\w*"),
      p2 = Pattern.compile("Java.*");
    for(int i = 0; i < input.length; i++) {
      System.out.println("input " + i + ": " + input[i]);
      Matcher
        m1 = p1.matcher(input[i]),
        m2 = p2.matcher(input[i]);
      while(m1.find())
        System.out.println("m1.find() '" + m1.group() +
          "' start = "+ m1.start() + " end = " + m1.end());
      while(m2.find())
        System.out.println("m2.find() '" + m2.group() +
          "' start = "+ m2.start() + " end = " + m2.end());
      if(m1.lookingAt()) // No reset() necessary
        System.out.println("m1.lookingAt() start = "
          + m1.start() + " end = " + m1.end());
      if(m2.lookingAt())
        System.out.println("m2.lookingAt() start = "
          + m2.start() + " end = " + m2.end());
      if(m1.matches()) // No reset() necessary
        System.out.println("m1.matches() start = "
          + m1.start() + " end = " + m1.end());
      if(m2.matches())
        System.out.println("m2.matches() start = "
          + m2.start() + " end = " + m2.end());
    }
    monitor.expect(new String[] {
      "input 0: Java has regular expressions in 1.4",
      "m1.find() 'regular' start = 9 end = 16",
      "m1.find() 'ressions' start = 20 end = 28",
      "m2.find() 'Java has regular expressions in 1.4'" +
      " start = 0 end = 35",
      "m2.lookingAt() start = 0 end = 35",
      "m2.matches() start = 0 end = 35",
      "input 1: regular expressions now " +
      "expressing in Java",
      "m1.find() 'regular' start = 0 end = 7",
      "m1.find() 'ressions' start = 11 end = 19",
      "m1.find() 'ressing' start = 27 end = 34",
      "m2.find() 'Java' start = 38 end = 42",
      "m1.lookingAt() start = 0 end = 7",
      "input 2: Java represses oracular expressions",
      "m1.find() 'represses' start = 5 end = 14",
      "m1.find() 'ressions' start = 27 end = 35",
      "m2.find() 'Java represses oracular expressions' " +
      "start = 0 end = 35",
      "m2.lookingAt() start = 0 end = 35",
      "m2.matches() start = 0 end = 35"
    });
  }
} ///:~


Notice that find( ) will locate the regular expression anywhere in the input, but lookingAt( ) and matches( ) only succeed if the regular expression starts matching at the very beginning of the input. While matches( ) only succeeds if the entire input matches the regular expression, lookingAt( )[67] succeeds if only the first part of the input matches.

Pattern flags

An alternative compile( ) method accepts flags that affect the behavior of regular expression matching:

Pattern Pattern.compile(String regex, int flag)


where flag is drawn from among the following Pattern class constants:

Compile Flag

Effect

Pattern.CANON_EQ

Two characters will be considered to match if, and only if, their full canonical decompositions match. The expression “a\u030A”, for example, will match the string “?” when this flag is specified. By default, matching does not take canonical equivalence into account.

Pattern.CASE_INSENSITIVE
(?i)

By default, case-insensitive matching assumes that only characters in the US-ASCII character set are being matched. This flag allows your pattern to match without regard to case (upper or lower). Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag.

Pattern.COMMENTS
(?x)

In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line. Unix lines mode can also be enabled via the embedded flag expression.

Pattern.DOTALL
(?s)

In dotall mode, the expression ‘.’ matches any character, including a line terminator. By default, the ‘.’ expression does not match line terminators.

Pattern.MULTILINE
(?m)

In multiline mode, the expressions ‘^’ and ‘$’ match the beginning and ending of a line, respectively. ‘^’ also matches the beginning of the input string, and ‘$’ also matches the end of the input string. By default, these expressions only match at the beginning and the end of the entire input string.

Pattern.UNICODE_CASE
(?u)

When this flag is specified, case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII character set are being matched.

Pattern.UNIX_LINES
(?d)

In this mode, only the ‘\n’ line terminator is recognized in the behavior of ‘.’, ‘^’, and ‘$’.

Particularly useful among these flags are Pattern.CASE_INSENSITIVE, Pattern.MULTILINE, and Pattern.COMMENTS (which is helpful for clarity and/or documentation). Note that the behavior of most of the flags can also be obtained by inserting the parenthesized characters, shown in the table beneath the flags, into your regular expression preceding the place where you want the mode to take effect.

You can combine the effect of these and other flags through an "OR" (‘|’) operation:

//: c12:ReFlags.java
import java.util.regex.*;
import com.bruceeckel.simpletest.*;

public class ReFlags {
  private static Test monitor = new Test();
  public static void main(String[] args) {
    Pattern p =  Pattern.compile("^java",
      Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
    Matcher m = p.matcher(
      "java has regex\nJava has regex\n" +
      "JAVA has pretty good regular expressions\n" +
      "Regular expressions are in Java");
    while(m.find())
      System.out.println(m.group());
    monitor.expect(new String[] {
      "java",
      "Java",
      "JAVA"
    });
  }
} ///:~


This creates a pattern that will match lines starting with “java,” “Java,” “JAVA,” etc., and attempt a match for each line within a multiline set (matches starting at the beginning of the character sequence and following each line terminator within the character sequence). Note that the group( ) method only produces the matched portion.
Thinking in Java
Prev Contents / Index Next


 
 
   Reproduced courtesy of Bruce Eckel, MindView, Inc. Design by Interspire