Ian Dickinson
Ian Dickinson

Reputation: 13315

Ruby gsub regex unexpected behaviour

I thought I knew regexes pretty well, but this has me puzzled:

irb(main):016:0> source = "/foo/bar"
=> "/foo/bar"
irb(main):017:0> source.gsub( /[^\/]*\Z/, "fubar" )
=> "/foo/fubarfubar"

As far as I can tell, /[^\/]*\Z/ has a unique expansion to match bar and therefore should result in /foo/fubar. I can't see at all why I get fubarfubar as the replacement.

The replacement works if I call sub rather than gsub, so it's not a question of working around the problem but rather uncovering my misunderstanding of gsub.

Upvotes: 2

Views: 266

Answers (2)

Stefan
Stefan

Reputation: 114268

I don't think this is a bug at all. Regular expressions can and will match zero-width positions.

Therefore, the regex engine sees the string "xox" more like this:

"" "x" "" "o" "" "x" ""

(fun fact: in Ruby, the above actually results in "xox")

If we gsub a single x with a _, everything works as expected:

"xox".gsub(/x/, "_") #=> "_o_"

But if we match x*, things get weird:

"xox".gsub(/x*/, "_") #=> "__o__"

This is because * matches zero or more times:

"" "x" "" "o" "" "x" ""
^^^^^^ ^^     ^^^^^^ ^^

It may be clearer if we reduce "zero or more" to just zero:

"xox".gsub(/x{0}/, "_") #=> "_x_o_x_"

The matches are:

"" "x" "" "o" "" "x" ""
^^     ^^     ^^     ^^

The same happens in your example. You match [^\/] zero or more times. The regex engine matches bar at the end of the string ([^\/] 3 times) and the void afterwards ([^\/] 0 times):

"/" "" "b" "" "a" "" "r" ""
    ^^^^^^^^^^^^^^^^^^^^ ^^

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

You need to use sub as you only need to replace once at the end of the string:

source.sub( /[^\/]*\Z/, "fubar" )
       ^^^

See the IDEONE demo

The problem is most probably with the way the matches are collected, and since you pattern matches an empty string, although at the end, the last null can also be treated as a 2nd match. It is not only a Ruby issue, a similar bug is present in many other languages.

So, actually, this is what is happening:

  • [^\/]*\Z pattern matches bar and replaces it with foobar
  • Regex index is at the end of the string - yes, there is a NULL, but Ruby still sees it as a valid "string" to process and
  • [^\/]*\Z matches the NULL, and adds another foobar.

If you need to use gsub, replace * quantifier that allows matching 0 characters with + that requires at least 1 occurrence of the quantified subpattern, avoid matching 0-length strings:

source.gsub( /[^\/]+\Z/, "fubar" )
                   ^

The rule of thumb: Avoid regexps that match empty strings inside Regex replace methods!

Upvotes: 5

Related Questions