Denis Yakovenko
Denis Yakovenko

Reputation: 3535

Capturing groups don't work as expected with Ruby scan method

I need to get an array of floats (both positive and negative) from the multiline string. E.g.: -45.124, 1124.325 etc

Here's what I do:

text.scan(/(\+|\-)?\d+(\.\d+)?/)

Although it works fine on regex101 (capturing group 0 matches everything I need), it doesn't work in Ruby code.

Any ideas why it's happening and how I can improve that?

Upvotes: 5

Views: 1169

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627087

See scan documentation:

If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.

You should remove capturing groups (if they are redundant), or make them non-capturing (if you just need to group a sequence of patterns to be able to quantify them), or use extra code/group in case a capturing group cannot be avoided.

  1. In this scenario, the capturing group is used to quantifiy a pattern sequence, thus all you need to do is convert the capturing group into a non-capturing one by replacing all unescaped ( with (?: (there is only one occurrence here):
text = " -45.124, 1124.325"
puts text.scan(/[+-]?\d+(?:\.\d+)?/)

See demo, output:

-45.124
1124.325

Well, if you need to also match floats like .04 you can use [+-]?\d*\.?\d+. See another demo

  1. There are cases when you cannot get rid of a capturing group, e.g. when the regex contains a backreference to a capturing group. In that case, you may either a) declare a variable to store all matches and collect them all inside a scan block, or b) enclose the whole pattern with another capturing group and map the results to get the first item from each match, c) you may use a gsub with just a regex as a single argument to return an Enumerator, with .to_a to get the array of matches:
text = "11234566666678"
# Variant a:
results = []
text.scan(/(\d)\1+/) { results << Regexp.last_match(0) }
p results                              # => ["11", "666666"]
# Variant b:
p text.scan(/((\d)\2+)/).map(&:first)  # => ["11", "666666"]
# Variant c:
p text.gsub(/(\d)\1+/).to_a  # => ["11", "666666"]

See this Ruby demo.

Upvotes: 8

lacostenycoder
lacostenycoder

Reputation: 11226

If you need capture groups for a complex pattern match, but want the entire expression returned by .scan, this can work for you.

Suppose you want to get the image urls in this string perhaps from a markdown text with html image tags:

str = %(
Before
<img src="https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-1842z4b73d71">

After
<img src="https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-a235b84bf150.png">).strip

You may have a regular expression defined to match just the urls, and maybe used a Rubular example like this to build/test your Regexp

image_regex = 
  /https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b/

Now you don't need each sub-capture group, but just the the entire expression in your your .scan, you can just wrap the whole pattern inside a capture group and use it like this:

image_regex = 
  /(https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b)/

str.scan(image_regex).map(&:first)
=> ["https://user-images.githubusercontent.com/1949900/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png",
 "https://user-images.githubusercontent.com/1949900/75255473-02bca700-57b0-11ea-852a-58424698cfb0.png"]

How does this actually work?

Since you have 3 capture groups, .scan alone will return an Array of arrays with, one for each capture:

str.scan(image_regex)
=> [["https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png", "user-", "githubusercontent"],
 ["https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-0714c8f76f68", nil, "zenhubusercontent"]]

Since we only want the 1st (outter) capture group, we can just call .map(&:first)

Upvotes: 1

garyh
garyh

Reputation: 2852

([+-]?\d+\.\d+)

assumes there is a leading digit before the decimal point

see demo at Rubular

Upvotes: 1

Related Questions