kos
kos

Reputation: 43

Regular expression with named subpattern doesn't see the best match

When using a defined subpattern inside of reg exp, it doesn't choose the best match, but stops on the first match. Did I forget some flag?

Regular Expression: (?<minutes>[0-9]|[1-5][0-9]):(?&minutes); Testing string: 47:24;.

Expression doesn't match:

pic 1 (47:24;)

But string 47:2; is matched correctly:

pic 2 (47:2;).

If I change 'or' condition to [1-5][0-9]|[0-9], reg exp (?<minutes>[1-5][0-9]|[0-9]):(?&minutes); works just fine. Is there any other way to make string '47:24;' match without reversing the 'or' condition?

Upvotes: 4

Views: 567

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89629

With PCRE, recursive groups are atomic (see this article). That is why the regex engine can't backtrack in (?&minutes).

In 42:24;, the 2 of 24 is matched by the first branch [0-9] (since the first win), but when the pattern fails, because there's a 4 in the string and not a ;, the regex engine can't backtrack inside the (?&minutes) subpattern to test the second branch [1-5][0-9]. (You can take a look at the debugger)

Solution: don't use a recursion for a so small subpattern, it's useless and make no sense (in particular if you use names for capture groups). Writing something like :

(?<minutes>[1-5]?[0-9]):(?<seconds>[1-5]?[0-9]);

or why not:

(?(DEFINE)(?<sex>[1-5]?[0-9]) for "sexagesimal", not for what you think)
(?<minutes>(?&sex)):(?<seconds>(?&sex));

seems redundant, but makes sense and is useful if you want to extract minutes and seconds (otherwise, don't use groups at all). After all, if you use named captures, your goal is not to write the shortest pattern of the world.

If you can't avoid an alternation:

  • You can put the longest branch first: [1-5][0-9]|[0-9] as suggested by Lucas.
  • you can also use mutually exclusive branches: [1-5][0-9]?|[06-9], [06-9]|[1-5][0-9]? (in this case the order doesn't matter)

Note that this behaviour of recursive groups is particular to PCRE, it is different with Perl or Ruby.

Upvotes: 2

Lucas Trzesniewski
Lucas Trzesniewski

Reputation: 51430

Patterns are matched left to right, and alternatives are tried left to right too. That's the way NFA regex engines work. PCRE has also has a DFA engine which will try to find the longest match, but it's not exposed to PHP.

So if you have a pattern like a|b and b is a subset of a, the engine will try a first and succeed. The b part will never be matched.

You could write \b(?:[1-5][0-9]|[0-9])\b but it seems redundant.

Just use \b[1-5]?[0-9]\b (as stribizhev suggested) to get it right all the time. \b is a word boundary, it'll ensure you match a whole number, and not jsut a few digits of a larger number.

Upvotes: 3

Related Questions