Chris Locke
Chris Locke

Reputation: 343

Discrepancy in `scan` and `match` behavior for different Ruby versions

Background

This question is in regards to the behavior of the String#scan and String#match method in Ruby. I am using a recursive regular expression which is supposed to match a balanced pair of braces. You can see this regular expression /(\((?:[^\(\)]*\g<0>*)*\))/ in action at: https://regex101.com/r/Q1lOC8/1. There it displays the expected behavior: match top level sets of brackets that have balanced sets of nested brackets. Some sample code illustrating the problem is as follows:

➜  cat test.rb                                                                          
s = "1+(x*(3-4)+5)-1"
r = /(\((?:[^\(\)]*\g<0>*)*\))/
puts s.match(r).inspect
puts s.scan(r).inspect

Problem

I get different results when running the above sample code in ruby-2.3.3 and ruby-2.4.1:

➜  docker run --rm -v "$PWD":/usr/src/app -w /usr/src/app ruby:2.3.3-alpine ruby test.rb
#<MatchData "(x*(3-4)+5)" 1:")">
[[")"]]
➜  docker run --rm -v "$PWD":/usr/src/app -w /usr/src/app ruby:2.4.1-alpine ruby test.rb
#<MatchData "(x*(3-4)+5)" 1:"(x*(3-4)+5)">
[["(x*(3-4)+5)"]]

The case in ruby-2.4.1 is what I was expecting. match is correctly matching to the same outer set of parentheses in both cases, (x*(3-4)+5), but in ruby-2.3.3 the first group match for some reason is just ")". If I change the regular expression to /(\(.*\))/, then the behavior is the same for both versions (same as 2.4.1 above), but it no longer will ensure nested braces are balanced.

What is the true expected behavior of match in this case?

Upvotes: 3

Views: 121

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627087

First, I should note that what works at regex101.com does not have to work anywhere: any regex written with the help of an online regex tester must be tested in the target environment. You tested with the PCRE option, and it worked, because PCRE is a different library than Onigmo used in Ruby.

Now, the problem seems to be how Onigmo regex engine handles recursion in 2.3.3: the \g<0> construct recurses the whole pattern (0th group), and the outer capturing parentheses (Group 1) are also repeated (while its ID is kept the same), effectively creating a repeated capturing group. The values in such groups are re-written at each iteration, and that is why you get ) in the end.

The work around is to recurse Group 1 subpattern to keep Group 1 value in full without re-writing its value upon each iteration (since a capturing group is defined in the pattern, String#scan only returns the capture(s)).

Use

r = /(\((?:[^\(\)]*\g<1>*)*\))/
                      ^

Upvotes: 1

Related Questions