Sawyer
Sawyer

Reputation: 15917

Regex greedy parse direction

I found there're two different opinions about how greedy regex is executed:

Greedy quantifiers are considered "greedy" because they force the matcher to read in, or eat, the entire input string prior to attempting the first match. If the first match attempt (the entire input string) fails, the matcher backs off the input string by one character and tries again, repeating the process until a match is found or there are no more characters left to back off from.

also see this article: Performance of Greedy vs. Lazy Regex Quantifiers

Repetition with Star and Plus the Looking Inside The Regex Engine section talk about <.+>:

The first token in the regex is <. This is a literal. As we already know, the first place where it will match is the first < in the string.

I want to know which one is correct? This matters because it will affect the efficiency of regex. I added various language tags, because I want to know if it's implemented differently in each language.

Upvotes: 3

Views: 829

Answers (4)

ikegami
ikegami

Reputation: 386206

"The matcher backs off the input string by one character and tries again" is just describing backtracking, so "then it'll backtrack" is therefore saying the same thing. Since both of your quotes regarding greediness say the same thing, both are correct. (Your third quote has nothing to do with greediness.)


Let's provide an example.

'xxabbbbbxxabbbbbbbbb' =~ /([ab]*)bb/;
  1. Try at pos 0.
    1. [ab]* matches 0 chars "".
      1. At pos 0, bb fails to match ⇒ Backtrack.
    2. [ab]* can't match any less ⇒ Backtrack.
  2. Try at pos 1.
    1. [ab]* matches 0 chars "".
      1. At pos 1, bb fails to match ⇒ Backtrack.
    2. [ab]* can't match any less ⇒ Backtrack.
  3. Try at pos 2.
    1. [ab]* matches 6 chars "abbbbb".
      1. At pos 8, bb fails to match ⇒ Backtrack.
    2. [ab]* matches 5 chars "abbbb". (Back off one)
      1. At pos 7, bb fails to match ⇒ Backtrack.
    3. [ab]* matches 4 chars "abbb". (Back off one)
      1. At pos 6, bb matches.
        1. Success.

So $1 is "abbb". (Not abbbbbbb. "Greedy" doesn't mean "longest match possible".)


Now, let's see what happens if we make the "*" non-greedy.

'xxabbbbbxxabbbbbbbbb' =~ /([ab]*?)bb/;
  1. Try at pos 0.
    1. [ab]*? matches 0 chars "".
      1. At pos 0, bb fails to match ⇒ Backtrack.
    2. [ab]* can't match any more ⇒ Backtrack.
  2. Try at pos 1.
    1. [ab]*? matches 0 chars "".
      1. At pos 1, bb fails to match ⇒ Backtrack.
    2. [ab]* can't match any more ⇒ Backtrack.
  3. Try at pos 2.
    1. [ab]*? matches 0 chars "".
      1. At pos 2, bb fails to match ⇒ Backtrack.
    2. [ab]*? matches 1 char "a". (Add one)
      1. At pos 3, bb matches.
        1. Success.

So $1 is "a".


Specific implementation might do things differently as an optimisation, as long as it gives the same result as presented here. You can see Perl at work using

perl -Mre=debug -E'say "xxabbbbbxxabbbbbbbbb" =~ /([ab]*)bb/;'
perl -Mre=debug -E'say "xxabbbbbxxabbbbbbbbb" =~ /([ab]*?)bb/;'

Upvotes: 0

hobbs
hobbs

Reputation: 240139

There's not a difference, and there's no "matching from the back" implied. When the first reference says "the matcher", it means the repeated regex atom, not the entire expression — i.e. when matching /ab+/ against "abbot", we don't know a priori that the b+ won't match the entire rest of the string. I think that the Java tutorial might be discussing the behavior of regex matching against streams, and noting that a greedy repetition will eagerly consume the rest of the stream, while a nongreedy repetition will lazily consume the stream as needed.

Upvotes: 0

Ωmega
Ωmega

Reputation: 43683

Implementation of regex in differenet envronments and programming languages does not have to be same, so there is no exact answer to your question.

If you go with Perl, then you should know that regex implementation is very optimized, as this is one of the strongest feature of this programming language. So, don't worry about efficiency :)

Upvotes: 0

David B
David B

Reputation: 2698

Assuming they're functionally equivalent (and based on my Java regex use, they are), it's just a difference in engine implementation. Regular Expressions are not implemented exactly the same way in all languages, and can be more or less powerful based on which language you use.

The second link describes Perl, so I'd trust Oracle on the Java side of things.

Both attempt to get the largest match possible.

Making a quantifier lazy by adding the ? key will attempt the smallest match possible.

Upvotes: 3

Related Questions