Follow Techotopia on Twitter

On-line Guides
All Guides
eBook Store
iOS / Android
Linux for Beginners
Office Productivity
Linux Installation
Linux Security
Linux Utilities
Linux Virtualization
Linux Kernel
System/Network Admin
Programming
Scripting Languages
Development Tools
Web Development
GUI Toolkits/Desktop
Databases
Mail Systems
openSolaris
Eclipse Documentation
Techotopia.com
Virtuatopia.com

How To Guides
Virtualization
General System Admin
Linux Security
Linux Filesystems
Web Servers
Graphics & Desktop
PC Hardware
Windows
Problem Solutions
Privacy Policy

  




 

 

The sed FAQ
Prev Home Next

5.4. My RE isn't matching/deleting what I want it to. (Or, "Greedy vs. stingy pattern matching")

The two most common causes for this problem are: (1) misusing the '.' metacharacter, and (2) misusing the '*' metacharacter. The RE '.*' is designed to be "greedy" (i.e., matching as many characters as possible). However, sometimes users need an expression which is "stingy," matching the shortest possible string.

(1) On single-line patterns, the '.' metacharacter matches any single character on the line. ('.' cannot match the newline at the end of the line because the newline is removed when the line is put into the pattern space; sed adds a newline automatically when the pattern space is printed.) On multi-line patterns obtained with the 'N' or 'G' commands, '.' will match a newline in the middle of the pattern space. If there are 3 lines in the pattern space, "s/.*//" will delete all 3 lines, not just the first one (leaving 1 blank line, since the trailing newline is added to the output).

Normal misuse of '.' occurs in trying to match a word or bounded field, and forgetting that '.' will also cross the field limits. Suppose you want to delete the first word in braces:

       echo {one} {two} {three} | sed 's/{.*}/{}/'       # fails
       echo {one} {two} {three} | sed 's/{[^}]*}/{}/'    # succeeds

's/{.*}/{}/' is not the solution, since the regex '.' will match any character, including the close braces. Replace the '.' with '[^}]', which signifies a negated character set '[^...]' containing anything other than a right brace. FWIW, we know that 's/{one}/{}/' would also solve our question, but we're trying to illustrate the use of the negated character set: [^anything-but-this].

A negated character set should be used for matching words between quote marks, for fields separated by commas, and so on. See also section 4.12 ("How do I parse a comma-delimited data file?").

(2) The '*' metacharacter represents zero or more instances of the previous expression. The '*' metacharacter looks for the leftmost possible match first and will match zero characters. Thus,

       echo foo | sed 's/o*/EEE/'

will generate 'EEEfoo', not 'fEEE' as one might expect. This is because /o*/ matches the null string at the beginning of the word.

After finding the leftmost possible match, the '*' is GREEDY; it always tries to match the longest possible string. When two or three instances of '.*' occur in the same RE, the leftmost instance will grab the most characters. Consider this example, which uses grouping '\(...\)' to save patterns:

       echo bar bat bay bet bit | sed 's/^.*\(b.*\)/\1/'

What will be displayed is 'bit', never anything longer, because the leftmost '.*' took the longest possible match. Remember this rule: "leftmost match, longest possible string, zero also matches."

The sed FAQ
Prev Home Next

 
 
   Reprinted courtesy of Eric Pement. Also available at https://sed.sourceforge.net/sedfaq.html Design by Interspire