Reputation: 43
When using a defined subpattern inside of reg exp, it doesn't choose the best match, but stops on the first match. Did I forget some flag?
Regular Expression: (?<minutes>[0-9]|[1-5][0-9]):(?&minutes);
Testing string: 47:24;
.
Expression doesn't match:
But string 47:2;
is matched correctly:
.
If I change 'or' condition to [1-5][0-9]|[0-9]
, reg exp (?<minutes>[1-5][0-9]|[0-9]):(?&minutes);
works just fine. Is there any other way to make string '47:24;' match without reversing the 'or' condition?
Upvotes: 4
Views: 567
Reputation: 89629
With PCRE, recursive groups are atomic (see this article). That is why the regex engine can't backtrack in (?&minutes)
.
In 42:24;
, the 2
of 24
is matched by the first branch [0-9]
(since the first win), but when the pattern fails, because there's a 4
in the string and not a ;
, the regex engine can't backtrack inside the (?&minutes)
subpattern to test the second branch [1-5][0-9]
. (You can take a look at the debugger)
Solution: don't use a recursion for a so small subpattern, it's useless and make no sense (in particular if you use names for capture groups). Writing something like :
(?<minutes>[1-5]?[0-9]):(?<seconds>[1-5]?[0-9]);
or why not:
(?(DEFINE)(?<sex>[1-5]?[0-9]) for "sexagesimal", not for what you think)
(?<minutes>(?&sex)):(?<seconds>(?&sex));
seems redundant, but makes sense and is useful if you want to extract minutes and seconds (otherwise, don't use groups at all). After all, if you use named captures, your goal is not to write the shortest pattern of the world.
If you can't avoid an alternation:
[1-5][0-9]|[0-9]
as suggested by Lucas.[1-5][0-9]?|[06-9]
, [06-9]|[1-5][0-9]?
(in this case the order doesn't matter)Note that this behaviour of recursive groups is particular to PCRE, it is different with Perl or Ruby.
Upvotes: 2
Reputation: 51430
Patterns are matched left to right, and alternatives are tried left to right too. That's the way NFA regex engines work. PCRE has also has a DFA engine which will try to find the longest match, but it's not exposed to PHP.
So if you have a pattern like a|b
and b
is a subset of a
, the engine will try a
first and succeed. The b
part will never be matched.
You could write \b(?:[1-5][0-9]|[0-9])\b
but it seems redundant.
Just use \b[1-5]?[0-9]\b
(as stribizhev suggested) to get it right all the time. \b
is a word boundary, it'll ensure you match a whole number, and not jsut a few digits of a larger number.
Upvotes: 3