Ruby Programming - Backslash Sequences in the Substitution

Ruby Programming
Previous Page	Home	Next Page

Backslash Sequences in the Substitution

Ruby Essentials
eBook

$8.99

eBookFrenzy.com

Earlier we noted that the sequences \1, \2, and so on are available in the pattern, standing for the nth group matched so far. The same sequences are available in the second argument of sub and gsub.

`"fred:smith".sub(/(\w+):(\w+)/, '\2, \1')`	�	`"smith, fred"`
`"nercpyitno".gsub(/(.)(.)/, '\2\1')`	�	`"encryption"`

There are additional backslash sequences that work in substitution strings: \& (last match), \+ (last matched group), \` (string prior to match), \' (string after match), and \\ (a literal backslash). It gets confusing if you want to include a literal backslash in a substitution. The obvious thing is to write

str.gsub(/\\/, '\\\\')

Clearly, this code is trying to replace each backslash in str with two. The programmer doubled up the backslashes in the replacement text, knowing that they'd be converted to ``\\'' in syntax analysis. However, when the substitution occurs, the regular expression engine performs another pass through the string, converting ``\\'' to ``\'', so the net effect is to replace each single backslash with another single backslash. You need to write gsub(/\\/, '\\\\\\\\')!

`str = 'a\b\c'`	�	`"a\b\c"`
`str.gsub(/\\/, '\\\\\\\\')`	�	`"a\\b\\c"`

However, using the fact that \& is replaced by the matched string, you could also write

`str = 'a\b\c'`	�	`"a\b\c"`
`str.gsub(/\\/, '\&\&')`	�	`"a\\b\\c"`

If you use the block form of gsub, the string for substitution is analyzed only once (during the syntax pass) and the result is what you intended.

`str = 'a\b\c'`	�	`"a\b\c"`
`str.gsub(/\\/) { '\\\\' }`	�	`"a\\b\\c"`

Finally, as an example of the wonderful expressiveness of combining regular expressions with code blocks, consider the following code fragment from the CGI library module, written by Wakou Aoyama. The code takes a string containing HTML escape sequences and converts it into normal ASCII. Because it was written for a Japanese audience, it uses the ``n'' modifier on the regular expressions, which turns off wide-character processing. It also illustrates Ruby's case expression, which we discuss starting on page 81.

def unescapeHTML(string)
  str = string.dup
  str.gsub!(/&(.*?);/n) {
    match = $1.dup
    case match
    when /\Aamp\z/ni           then '&'
    when /\Aquot\z/ni          then '"'
    when /\Agt\z/ni            then '>'
    when /\Alt\z/ni            then '<'
    when /\A#(\d+)\z/n         then Integer($1).chr
    when /\A#x([0-9a-f]+)\z/ni then $1.hex.chr
    end
  }
  str
end

puts unescapeHTML("1&lt;2 &amp;&amp; 4&gt;3")
puts unescapeHTML("&quot;A&quot; = &#65; = &#x41;")

produces:

1<2 && 4>3
"A" = A = A

Ruby Programming
Previous Page	Home	Next Page