Reputation: 77474
I have tried the same expression in Python and it seems to be okay here, while Java fails with a stack overflow.
This is the simplified test case to demonstrate the problem:
// 10k whitespace.
char[] buf = new char[10000];
Arrays.fill(buf, ' ');
String post = new String(buf);
// All whitespace - works
System.out.println(Pattern.compile(" +").matcher(post).matches());
// All whitespace, or whitespace - Stack Overflow
System.out.println(Pattern.compile("(?: | )+").matcher(post).matches());
The first regular expression, " +"
works fine. The second one, "( | )+"
- which obviously should also match this string, however causes a stack overflow.
I guess this is a limitation due to the way regular expressions (in particular: alternatives) are implemented in Java... state machine based regexp matchers seem to be okay (they will stay in an accepting state); or does pythons regexp engine just have a much larger stack?
Disabling backtracking via atomic groups also works: "(?> | )+"
- I guess in this case, Java will no longer add 6 stack frames on each match (and apparently, the stack cannot fit 60000 frames by default).
This is not just a theoretical example. Consider e.g. "(apple|banana)+"
:
StringBuilder buf = new StringBuilder();
for (int i = 0; i < 10000; i++)
buf.append(Math.random() < .5 ? "apple" : "banana");
String s = buf.toString();
System.out.println(s.replaceAll("(apple|banana)+", "lots of fruits"));
With "(?>apple|banana)+"
it will print as desired lots of fruits
; without the backtracking prevention it will cause a stack overflow.
Yes, I know this is a form of catastrophic backtracking... What surprises me is that Java fails early, where Python still hums along happily, fruitfully removing undesired text... is python smarter, and recognizes this can be processed "greedy" without backtracking? Or does it just make better use of memory?
Upvotes: 4
Views: 271
Reputation: 44376
( | )+
is a pretty basic example of possible catastrophic backtracking. In this case, it's probably better called catastrophic branching, as no backtracking is involved.
It appears as if Java may implement branching using recursion, although I'm not sure if 10000 levels deep is deep enough to trigger an overflow. It's also possible that it's trying to go 2^10000 levels deep, but as I know nothing about the internals, this is pure speculation. Update: You said that 1000 wasn't enough to overflow, so it definitely looks linear and not exponential.
Why don't you try shortening your string, and see exactly how long it has to be before you get the overflow? Also, try something like (apple|banana)+
, and see if it suffers from the same problem. Update: The problem appears to occur for any branching scenario. It's definitely a weakness in the Java regex engine, though I couldn't tell you exactly why. In addition to Python, I can confirm it works fine in .NET and JavaScript (my browser, anyway).
I don't see this being a problem in an NFA-driven engine. I'm assuming that Java uses another approach, but I couldn't find anywhere that this is documented.
Upvotes: 5