Reputation: 1531
This question is specific to Java 7/8.
A fairly complex regex that uses quantifiers are prohibited inside lookbehind assertions such as this one:
(?<=(a|b*)*)bc
as it results in an runtime exception with a message like:
look-behind group does not have obvious maximum length error
I am guessing it is because quantifiers such as *
and +
are "generally" not allowed.
However, the following does work:
(?<=a*)bc
Why is this so?
There are similar posts about this issue on SO:
Some posts are specific to Java but don't seem to provide concrete answers or references such as this one. Mostly the answers state these quantifiers simply cannot be used. Also, regular-expressions website states the same.
This post states Java has bugs in it's look behind implementation.
However, the example that I have shown above that uses the zero or many quantifier *
inside the lookbehind is valid for Java 7/8.
Any references or explanation would be helpful.
Upvotes: 2
Views: 2641
Reputation: 31699
After looking at the Pattern
code and trying to trace it, I'm convinced that this is just a bug. Both examples should result in an exception. But the logic that tests for this is incorrect.
This code appears in a couple places:
temp = info.maxLength * cmax + maxL;
info.maxLength = temp;
if (temp < maxL) {
info.maxValid = false;
}
Note that as long as maxLength
and cmax
are nonnegative, temp
should never be less than maxL
unless overflow has occurred. maxValid
is eventually used by the lookbehind code; if maxValid
is false
, "look-behind group does not have obvious maximum length error"
is thrown.
From what I can tell, in a regex like <prefix> <expression>{m,n}
, in the above code, info.maxLength
is the maximum length of "expression", cmax
is the upper bound of the quantifier, and maxL
is the maximum length of "prefix". When a quantifier is *
or +
, the upper bound is set to Integer.MAX_VALUE
. (All the variables here are int
.) This means that there will be overflow unless info.maxLength
is 1 and maxL
is 0. That's exactly the case with
(?<=a*)bc
since the pattern with the quantifier has length 1, and there is nothing before the a*
which means maxL
will be 0. That's why this case falls through the cracks.
For any other values, the computation will overflow, but that doesn't necessarily mean temp < maxL
will be true. If info.maxLength
is even, then an exception will be thrown; but if info.maxLength
is odd, the pattern will compile if maxL
is small enough. This is because of the way wraparound works, mathematically; the code that attempts to check for overflow is quite faulty. This means that
(?<=a*)bc // succeeds
(?<=(ab)*)bc // throws exception
(?<=(abc)*)bc // succeeds
(?<=(abcd)*)bc // throws exception
(?<=(abcde)*)bc // succeeds
Also:
(?<=xa*)bc // throws exception
(?<=x(abc)*)bc // succeeds
Note: It should be noted that in your example regex, lookbehind is useless:
(?<=a*)bc
The lookbehind says to test whether the current location is preceded by zero or more occurrences of the letter a
. This is always true, trivially. So the lookbehind serves no purpose here. Similarly,
(?<=a+)bc
is equivalent to
(?<=a)bc
since as long as there's one a
preceding the current location, it doesn't matter how many more there might be. The same is true of all the examples that don't throw an exception, except for this one:
(?<=x(abc)*)bc // succeeds
since here the matcher does have to go backwards looking for abc
's in the string and making sure the last one is preceded by x
. It seems that Pattern
should throw an exception in this case, but because of the faulty overflow-check logic, it isn't. Therefore, I'm uncertain whether it would actually return the correct result, or whether it might crash or go into an infinite loop. However, I haven't tried it.
Really, the code should just check directly whether cmax
is equal to Integer.MAX_VALUE
, instead of using it in computation and hoping the code can tell later that the result is bogus.
Upvotes: 1