Ruby Regexp::scan with MatchData
This post will only be of interest to people writing scripts in Ruby. Seriously, zero utility if you’re not using Ruby. Though I would be curious how you accomplish the same thing in other languages like Rust and Python, because I’ve never gotten too deep with string manipulation in anything other than Ruby, Swift, and Objective-C. If you care to leave a comment with pointers, I’m all ears.
I do a lot of string manipulation in Ruby. One of the things that always gets me is that the Regexp::match
method returns groups but only matches the first instance. To match all instances for enumeration, you have to use Regexp::scan
. But scan
doesn’t include groups (i.e. MatchData
). So a while back I figured out the solution, and I thought I’d share it for any aspiring Ruby scripters.
The trick is to map scan results and replace each result with Regexp::last_match
, which includes groups (and named groups) from the last regex that was run. Thus:
str.to_enum(:scan, regex).map { Regexp.last_match }
results in an array of MatchData
. Then you can iterate through it and use indexes or group names to pull out particular groups of each match.
I’ve combined this with a few other methods to create a general string handling routine that I use regularly.
# frozen_string_literal: true
# String helpers
class ::String
def match_scan(regex)
to_enum(:scan, regex).map { Regexp.last_match }
end
def matches(regex)
match_scan(regex).match_to_h.map(&:symbolize_keys)
end
end
# Array helpers
class ::Array
def match_to_h
map { |m| m.named_captures.each_with_object({}) { |(k, v), h| h[k] = v&.strip } }
end
end
# Hash helpers
class ::Hash
def symbolize_keys
each_with_object({}) { |(k, v), hsh| hsh[k.to_sym] = v.is_a?(Hash) ? v.symbolize_keys : v }
end
end
With the above methods available, you can do something like:
str = <<~EOEMAILS
Arthur P. Dent <arthur@example.com>
Ford Prefect <perfect@example.com>
Zaphod Beeblebrox <zaph@example.com>
Mrs. Alice Beeblebrox <zaphsfav@example.com>
Slartibartfast <fjordmaster@example.com>
Marvin the Paranoid Android <planetbrain@example.com>
EOEMAILS
rx = /(?<prefix>\S+\. )?(?<first>.*?)(?:( (?<middle>\w+\.?))*(?: (?<last>[\w-]+)))? <(?<email>.*?)>/i
pp str.matches(rx)
Running that results in:
[{:prefix=>nil,
:first=>"Arthur",
:middle=>"P.",
:last=>"Dent",
:email=>"arthur@example.com"},
{:prefix=>nil,
:first=>"Ford",
:middle=>nil,
:last=>"Prefect",
:email=>"perfect@example.com"},
{:prefix=>nil,
:first=>"Zaphod",
:middle=>nil,
:last=>"Beeblebrox",
:email=>"zaph@example.com"},
{:prefix=>"Mrs.",
:first=>"Alice",
:middle=>nil,
:last=>"Beeblebrox",
:email=>"zaphsfav@example.com"},
{:prefix=>nil,
:first=>"Slartibartfast",
:middle=>nil,
:last=>nil,
:email=>"fjordmaster@example.com"},
{:prefix=>nil,
:first=>"Marvin",
:middle=>"Paranoid",
:last=>"Android",
:email=>"planetbrain@example.com"}]
That’s a silly example, but hopefully you can see the utility of turning a regular expression into an array of hashes containing the individual values of each match extracted by scanning the string.