GNU C Library (libc) Programming Guide - Subexpression Complications

Next: Regexp Cleanup, Previous: Regexp Subexpressions, Up: Regular Expressions

10.3.5 Complications in Subexpression Matching

Sometimes a subexpression matches a substring of no characters. This happens when `f\(o*\)' matches the string `fum'. (It really matches just the `f'.) In this case, both of the offsets identify the point in the string where the null substring was found. In this example, the offsets are both 1.

Sometimes the entire regular expression can match without using some of its subexpressions at all—for example, when `ba\(na\)*' matches the string `ba', the parenthetical subexpression is not used. When this happens, regexec stores -1 in both fields of the element for that subexpression.

Sometimes matching the entire regular expression can match a particular subexpression more than once—for example, when `ba\(na\)*' matches the string `bananana', the parenthetical subexpression matches three times. When this happens, regexec usually stores the offsets of the last part of the string that matched the subexpression. In the case of `bananana', these offsets are 6 and 8.

But the last match is not always the one that is chosen. It's more accurate to say that the last opportunity to match is the one that takes precedence. What this means is that when one subexpression appears within another, then the results reported for the inner subexpression reflect whatever happened on the last match of the outer subexpression. For an example, consider `\(ba\(na\)*s \)*' matching the string `bananas bas '. The last time the inner expression actually matches is near the end of the first word. But it is considered again in the second word, and fails to match there. regexec reports nonuse of the “na” subexpression.

Another place where this rule applies is when the regular expression

     \(ba\(na\)*s \|nefer\(ti\)* \)*

matches `bananas nefertiti'. The “na” subexpression does match in the first word, but it doesn't match in the second word because the other alternative is used there. Once again, the second repetition of the outer subexpression overrides the first, and within that second repetition, the “na” subexpression is not used. So regexec reports nonuse of the “na” subexpression.