Andrew Anderson
Andrew Anderson

Reputation: 441

Ruby regex too greedy with back to back matches

I'm working on some text processing in Ruby 1.8.7 to support some custom shortcodes that I've created. Here are some examples of my shortcode:

[CODE first-part]
[CODE first-part second-part]

I'm using the following RegEx to grab the

text.gsub!( /\[CODE (\S+)\s?(\S?)\]/i, replacementText )

The problem is this: the regex doesn't work on the following text:

[CODE first-part][CODE first-part-again]

The results are as follows:

1.  first-part][CODE
2.  first-part-again

It seems that the \s? is the problematic part of the regex that is searching on until it hits the last space, not the first one. When I change the regex to the following:

\[CODE ([\w-]+)\s?(\S*)\]/i

It works fine. The only concern I have is what all \w vs \s as I want to make sure the \w will match URL-safe characters.

I'm sure there's a perfectly valid explanation, but it's eluding me. Any ideas? Thanks!

Upvotes: 1

Views: 176

Answers (2)

Martin Ender
Martin Ender

Reputation: 44259

Actually, thinking about it, just using [^\]] might not be enough, as it will swallow up all spaces as well. You also need to exclude those:

/\[CODE[ ]([^\]\s]+)\s?([^\]\s]*)\]/i

Note the [ ] - I just think it makes literal spaces more readable.

Working demo.

Explained in free-spacing mode:

\[CODE[ ]    # match your identifier   
(            # capturing group 1 
  [^\]\s]+   # match one or more non-], non-whitespace characters
)            # end of group 1
\s?          # match an optional whitespace character
(            # capturing group 2 
  [^\]\s]+   # match zero or more non-], non-whitespace characters
)            # end of group 2
\]           # match the closing ]

As none of the character classes in the pattern includes ], you can never possibly go beyond the end of the square bracketed expression.

By the way, if you find unnecessary escapes in regex as obscuring as I do, here is the minimal version:

/\[CODE[ ]([^]\s]+)\s?([^]\s]*)]/i

But that is definitely a matter of taste.

Upvotes: 2

Neil Slater
Neil Slater

Reputation: 27207

The problem was with the greedy \S+ in this

/\[CODE (\S+)\s?(\S?)\]/i

You could try:

/\[CODE (\S+?)\s?(\S?)\]/i

but actually your new character class is IMO superiror.

Even better might be:

/\[CODE ([^\]]+?)\s?([^\]]*)\]/i

Upvotes: 1

Related Questions