Reputation: 1364
I'm working with pattern matching in Postgresql 9.4. I run this query:
select regexp_matches('aaabbb', 'a+b+?')
and I expect it to return 'aaab'
but instead it returns 'aaabbb'
. Shouldn't the b+?
atom match only one 'b'
since it is not greedy? Is the greediness of the first quantifier setting the greediness for the whole regular expression?
Upvotes: 2
Views: 215
Reputation: 13640
Here is what I've found in postgresql 9.4's documentation:
Once the length of the entire match is determined, the part of it that matches any particular subexpression is determined on the basis of the greediness attribute of that subexpression, with subexpressions starting earlier in the RE taking priority over ones starting later.
and
If the RE could match more than one substring starting at that point, either the longest possible match or the shortest possible match will be taken, depending on whether the RE is greedy or non-greedy.
An example of what this means:
SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
Result: 123
SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
Result: 1
In the first case, the RE as a whole is greedy because Y*
is greedy. It can match beginning at the Y, and it matches the longest possible string starting there, i.e., Y123
. The output is the parenthesized part of that, or 123
. In the second case, the RE as a whole is non-greedy because Y*?
is non-greedy. It can match beginning at the Y, and it matches the shortest possible string starting there, i.e., Y1
. The sub-expression [0-9]{1,3}
is greedy but it cannot change the decision as to the overall match length; so it is forced to match just 1.
Meaning that the greediness of an operator is determined by the the ones defined prior to it.
I guess you have to use a+?b+?
for achieving what you want.
Upvotes: 3