Alan W. Smith
Alan W. Smith

Reputation: 25425

Ruby one-liner to capture regular expression matches

In Perl, I use the following one line statements to pull matches out of a string via regular expressions and assign them. This one finds a single match and assigns it to a string:

my $string = "the quick brown fox jumps over the lazy dog.";

my $extractString = ($string =~ m{fox (.*?) dog})[0];

Result: $extractString == 'jumps over the lazy'

And this one creates an array from multiple matches:

my $string = "the quick brown fox jumps over the lazy dog.";

my @extractArray = $string =~ m{the (.*?) fox .*?the (.*?) dog};

Result: @extractArray == ['quick brown', 'lazy']

Is there an equivalent way to create these one-liners in Ruby?

Upvotes: 7

Views: 3541

Answers (3)

falsetru
falsetru

Reputation: 368924

Use String#match and MatchData#[] or MatchData#captures to get matched backreferences.

s = "the quick brown fox jumps over the lazy dog."

s.match(/fox (.*?) dog/)[1]
# => "jumps over the lazy"
s.match(/fox (.*?) dog/).captures
# => ["jumps over the lazy"]

s.match(/the (.*?) fox .*?the (.*?) dog/)[1..2]
# => ["quick brown", "lazy"]
s.match(/the (.*?) fox .*?the (.*?) dog/).captures
# => ["quick brown", "lazy"]

UPDATE

To avoid undefined method [] error:

(s.match(/fox (.*?) cat/) || [])[1]
# => nil
(s.match(/the (.*?) fox .*?the (.*?) cat/) || [])[1..2]
# => nil
(s.match(/the (.*?) fox .*?the (.*?) cat/) || [])[1..-1] # instead of .captures
# => nil

Upvotes: 8

sawa
sawa

Reputation: 168071

string = "the quick brown fox jumps over the lazy dog."

extract_string = string[/fox (.*?) dog/, 1]
# => "jumps over the lazy"

extract_array = string.scan(/the (.*?) fox .*?the (.*?) dog/).first
# => ["quick brown", "lazy"]

This approach will also return nil (instead of throwing an error) if no match is found.

extract_string = string[/MISSING_CAT (.*?) dog/, 1]
# => nil

extract_array = string.scan(/the (.*?) MISSING_CAT .*?the (.*?) dog/).first
# => nil

Upvotes: 12

the Tin Man
the Tin Man

Reputation: 160551

First, be careful thinking in Perl terms when writing in Ruby. We do things a bit more verbosely to make the code more readable.

I'd write my @extractArray = $string =~ m{the (.*?) fox .*?the (.*?) dog}; as:

string = "the quick brown fox jumps over the lazy dog."

string[/the (.*?) fox .*?the (.*?) dog/]
extract_array = $1, $2
# => ["quick brown", "lazy"]

Ruby, like Perl, is aware of the capture groups, and assigns them to values $1, $2, etc. Those make it very clean and clear when grabbing values and assigning them later. The regex engine lets you create and assign named captures also, but they tend to obscure what's happening, so, for clarity, I tend to go this way.

We can use match to get there too:

/the (.*?) fox .*?the (.*?) dog/.match(string) # => #<MatchData "the quick brown fox jumps over the lazy dog" 1:"quick brown" 2:"lazy">

but is the end result more readable?

extract_array = /the (.*?) fox .*?the (.*?) dog/.match(string)[1..-1] 
# => ["quick brown", "lazy"]

The named captures are interesting too:

/the (?<quick_brown>.*?) fox .*?the (?<lazy>.*?) dog/ =~ string
quick_brown # => "quick brown"
lazy # => "lazy"

But they result in wondering where those variables were initialized and assigned; I sure don't look in regular expressions for those to occur, so it's potentially confusing to others, and becomes a maintenance issue again.


Cary says:

To elaborate a little on named captures, if match_data = string.match /the (?.?) fox .?the (?.*?) dog/, then match_data[:quick_brown] # => "quick brown" and match_data[:lazy] # => "lazy" (as well as quick_brown # => "quick brown" and lazy # => "lazy"). With named captures available, I see no reason for using global variables or Regexp.last_match, etc.

Yes, but there's some smell there too.

We can use values_at with the MatchData result of match to retrieve the values captured, but there are some unintuitive behaviors in the class that turn me off:

/the (?<quick_brown>.*?) fox .*?the (?<lazy>.*?) dog/.match(string)['lazy']

works, and implies that MatchData knows how to behave like a Hash:

{'lazy' => 'dog'}['lazy'] # => "dog"

and it has a values_at method, like Hash, but it doesn't work intuitively:

/the (?<quick_brown>.*?) fox .*?the (?<lazy>.*?) dog/.match(string).values_at('lazy') # => 
# ~> -:6:in `values_at': no implicit conversion of String into Integer (TypeError)

Whereas:

/the (?<quick_brown>.*?) fox .*?the (?<lazy>.*?) dog/.match(string).values_at(2) # => ["lazy"]

which now acts like an Array:

['all captures', 'quick brown', 'lazy'].values_at(2) # => ["lazy"]

I want consistency and this makes my head hurt.

Upvotes: 3

Related Questions