The MatchData object in Ruby gsub blocks
This post is absolutely only of interest to Ruby programmers. Just to save you some time.
I use regular expressions in Ruby a lot. One of the features I’ve come to use frequently is the block syntax for gsub
calls. Whereas the other syntaxes for gsub
really only provide back referencing for capture groups in replacements, the block syntax allows much more flexibility.
You have access to the $
variables for capture groups, but you also have the full power of the Regexp class available to the captures within the block. Just in case anybody else doesn’t know, here’s the scoop…
Typically, gsub
(the global version of sub
) is used as a pattern/replacement method with simple \1
, \2
back references to make use of capture groups in the pattern (regular expression).
puts "A grin".gsub(/\b[A-Z] (\w+)/, 'Cheshire \1')
=> Cheshire grin
You can also pass a hash as the second argument, and do literal string replacement based on secondary matching.
puts "A grin".gsub(/(\w+)/, 'grin' => 'cat', 'A' => 'Cheshire')
=> Cheshire cat
These are essential tools for quick string manipulations. As you move on to parsing larger quantities of text, you usually want to do something further with the matches, whether it’s additional logic or just more complex manipulations than simple \1
syntax provides. That’s where the block format is perfect.
A gsub
call with a block looks like this:
string = "How puzzling all these changes are!"
string.gsub!(/\b(\w+)/) do |match|
if match =~ /^(\p{Lu}|t)/
match.reverse
else
match.split('').sort().join('')
end
end
end
puts string
=> woH gilnpuzz all eseht aceghns aer!
Within that block, I always expected “match” to carry the full set of MatchData methods with it, but it’s just the full string of the overall match. You do have access to the $
operators, which you can use for referencing capture groups ($1
,$2
,…) in the match. However, you also have access to Regexp.last_match
, which provides a MatchData
object for the current gsub
iteration with all of the capture group’s methods such as :names
, :length
and :offset
, the original string (:to_s
), etc..
You can even get the “pre” and “post” parts of the original string for checking context within a broader search expression. I won’t go into a detailed example, but here’s sample usage;
"I'm late, I'm late".gsub(/(\w+)/) do |match|
m = Regexp.last_match
string = m.to_s
before_string = m.pre_match
after_string = m.post_match
# ...
string
end
You can actually leave off the block param (|match|
) entirely. The “match” variable in this case is the equivalent of Regexp.last_match.to_s
.
my_string.gsub!(/[[:punct:]]/) do
match = Regexp.last_match.to_s
# ...
end
You could also use Regexp.last_match[0]
. The MatchData
object provides direct access to capture group strings when addressed as an array (:[]
), 0 being the full matched string.
Store the :last_match
object for each iteration in a variable at the top of the block. If you call any Regexp methods within the block, last_match
will be modified.
For short runs, you can put the block format in a single line with bracket syntax and ternary operators. Here’s an overdrawn example to illustrate a simple one-liner:
class String
def hatter
gsub(/[[:alpha:]]/) {|m| Regexp.last_match.offset(0)[0] % 3 == 1 ? m.upcase : m.downcase }
# that was the one-liner!
end
end
string = "But I don't want to go among mad people.\nOh, you can't help that. We're all mad here. I'm mad. You're mad.\nHow do you know I'm mad?\nYou must be. Or you wouldn't have come here."
puts string.hatter
=> bUt I dOn'T wAnt to go amOng maD pEopLe.
oh, yOu Can't HelP tHat. wE'rE aLl Mad heRe. i'M mAd. yoU'rE mAd.
hoW dO yOu KnoW i'm Mad?
yOu MusT bE. Or You woUldN't haVe ComE hEre.
Loads of fun. Of course this is only useful for string manipulation/processing up to a certain limit, at which point you’ll probably want to start studying StringScanner.
“I haven’t the slightest idea,” said the Hatter.