Amr Bekhit
Amr Bekhit

Reputation: 4813

Simplifying a regular expression that causes a Java StackOverflowException

I'm trying to extract the following items from a C file:

I've written the following regex to try and find those items:

/\*(?:.|[\r\n])*?\*/|"(?:[^"\\\r\n]|\\.)*"|//.*|\b\d+\b|\b0[xX][\da-fA-F]+\b

The expression is composed of five parts ORed together.

Although the expression seems to work fine when tested using regexpal and a 500 line C file, my Java program throws a StackOverflowException after a few matches.

Here is the Java code that uses the regex:

Pattern itemsOfInterestPattern = Pattern.compile(
        "/\\*(?:.|[\\r\\n])*?\\*/|\"(?:[^\"\\\\\\r\\n]|\\\\.)*\"|//.*|\\b\\d+\\b|\\b0[xX][\\da-fA-F]+\\b");
// Now, go through the source file and process any tags we find
Matcher itemsOfInterestMatcher = itemsOfInterestPattern.matcher(sourceFile);
int matchNumber = 0;
while (itemsOfInterestMatcher.find()) {
    // We've found a token
    ++matchNumber;
    String token = itemsOfInterestMatcher.group();
    // I then have a switch statement that processes each match depending on its type
}

The stack trace when the overflow occurs can be found at http://pastebin.com/7eL6mVd2

What's causing the stack overflow and how can I change the expression to allow it to work?

Amr

Upvotes: 1

Views: 764

Answers (2)

ruakh
ruakh

Reputation: 183341

Judging from the number of times that java.util.regex.Pattern$LazyLoop.match(...) appears in the stack-trace, I'm betting the problem is the use of the reluctant quantifier *?: first it tries to match nothing, then it backtracks and tries to match one character, then it backtracks and tries to match two characters, and so on. So if you have a long comment, it will have to do a lot of backtracking, which apparently involves recursion. (I don't know if all backtracking involves recursion, or just reluctant-quantifier backtracking; in fact, until now, I didn't even realize that reluctant-quantifier backtracking did.) If you change this part:

/\*(?:.|[\r\n])*?\*/

to this:

/\*(?:[^*]|\*(?!/))*+\*/

(using the possessive quantifier *+ instead — it tries to match as much as it can, and never gives anything back), I think you'll find you can handle long comments much better. So, overall, your string-literal would look like this:

"/\\*(?:[^*]|\\*(?!/))*+\\*/|\"(?:[^\"\\\\\\r\\n]|\\\\.)*\"|//.*|\\b\\d+\\b|\\b0[xX][\\da-fA-F]+\\b"

Edited to add (July '13): Someone at my company had a similar issue recently, which led me to look a bit deeper into the cause. What I found is, the problem isn't the backtracking alone, but the combination of the backtracking with the subgroup; for example, a* or a*? would not have this problem, but (a)* or (a)*? or (?:a)* or (?:a)*? would. Above, I suggested disabling backtracking, by using *+ instead of *? (and making the necessary changes to the subexpression); but another approach would have been to eliminate the subexpression, by changing this:

/\*(?:.|[\r\n])*?\*/

to this:

/\*(?s:.*?)\*/

(where the (?s:...) notation is equivalent to ..., except that it locally turns on MULTILINE mode, meaning that . will match any character, even \n). The .*? doesn't require recursion in order to enable backtracking.

That said, I think the *+ approach is better in this case, and perhaps in most cases, since its algorithmic time complexity is lower. (.*? requires constantly trying to match and re-match the rest of the pattern; it can perform arbitrary backtracking without overflowing the stack, but it can take an inordinate amount of time to do so.)

Upvotes: 2

aretai
aretai

Reputation: 1641

Given the error (we don't see the code:() Important tip is that the actual answer is usually hidden somewhere in the first ten lines of the stack trace. You need to read it several times and then check the code. It seems that most of the errors are to do with regex except first two:

at java.lang.AbstractStringBuilder.charAt(AbstractStringBuilder.java:173)
at java.lang.StringBuilder.charAt(StringBuilder.java:55)

IMHO you should check these two lines (maybe with a debugger). Other posts say that such errors may be thrown when you run out of memory - How to validate big xml against xsd schema?. Try smaller file with less comments to start with and check if this error still occurs.

Upvotes: 0

Related Questions