cristi.calugaru
cristi.calugaru

Reputation: 601

Avoid Java regex catastrophic backtracking

I have a regular expression, read from an XML, that is being used from two different tools. A Java one and a C++ one.

[…!\?\.](\)|\]|“|'|"|’|”|‘|´|''|»)*

Trying to match the following string:

!!!!''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''??

The input data comes from some "big data" stored on HDFS.

In Java, it goes on backtracking forever, while in the C++ version it goes fine. The problem is that I cannot change the regular expression, since it is used by other external modules too, and it's hard to motivate a change since it works fine from C++.

Is there a way I could avoid this issue by not changing the regex? I tried appending a "$" after it with no luck.

Upvotes: 2

Views: 1501

Answers (1)

cristi.calugaru
cristi.calugaru

Reputation: 601

The problem was related to the fact that the regex had both a " ' " and a " '' " (one apostrophe OR two apostrophes) The simple fix for this would be to eliminate the extra " |'' " (2 apostrophes) as it already is looking for one ("|' ") and it has a grouping of ()* (so everything inside the parenthesis is looked up zero or more anyhow). It makes no difference for the logic of the regex, but it fixes the problem. Thanks for all your input.

Upvotes: 1

Related Questions