Reputation: 126
I have an regex expression to find links in texts:
(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^'\"\\n\\r()<>]+|\\(([^'\"\\n\\r()<>]+|(\\([^'\"\\n\\r()<>]+\\)))*\\))+(?:\\(([^'\"\\n\\r()<>]+|(\\([^'\"\\n\\r()<>]+\\)))*\\)|[^'\"\\n\\r`!()\\[\\]{};:'.,<>?\u00AB\u00BB\u201C\u201D\u2018\u2019]))
But some (
in a link is causing an thread lock. Searching the Internet I've found some website suggesting that's a Catastrophic Backtracking problem. I've spent some time to optimize the pattern but it does not work. Any ideas?
Example input link that is causing the problem:
https://subdomain.domain.com/web/?id=-%c3%a1(%c2%81y%e2%80%9a%c3%a5d%e2%80%ba%c3%a8%c2%a7%c2%be.%c3%a9+%c2%a8
Upvotes: 1
Views: 211
Reputation: 627082
You should keep to the principle: all subsequent adjoining subpatterns cannot match at the same location in the string. If you quantify them with *
or ?
, make sure those obligatory patterns before them do not match the same text. Else, revamp the pattern. Or make use of atomic groupings.
The (?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)
part is an alternation where both can match at the same location in the string. This cannot be avoided, so use an atomic group to prevent backtracking into the pattern.
Look at [a-z0-9.-]+[.]
, .
is present in the +
quantified character class. Make it more linear, replace with [a-z0-9-]*(?:\.[a-z0-9-]*)*\.
.
The (?:[^'"\n\r()<>]+|\(([^'"\n\r()<>]+|(\([^'"\n\r()<>]+\)))*\))+
part is a buggy pattern: [^'"\n\r()<>]+
is +
quantified, and again, and it leads to situations when the regex engine reduces it to (?:a+)+
, a classical CA scenario. Use atomic groupings if you do not want to re-vamp, although it seems to be a part of a pattern matching balanced parentheses and can be re-written as [^'"\n\r()<>]*(?:\((?>[^()]+|(?<o>\()|(?<-o>\)))*(?(o)(?!))\)[^'"\n\r()<>]*)*
.
The ([^'"\n\r()<>]+|(\([^'"\n\r()<>]+\)))*
part is similar to the part above, change (
to (?>
where you quantify the group and the single obligatory pattern inside it.
The fixed pattern is
var pattern = @"(?i)\b((?>https?://|www\d{0,3}\.|[a-z0-9-]*(?:\.[a-z0-9-]*)*\.[a-z]{2,4}/)(?>[^'""\n\r()<>]+|\((?>[^'""\n\r()<>]+|\([^'""\n\r()<>]+\))*\))+(?:\((?>[^'""\n\r()<>]+|(\([^'""\n\r()<>]+\)))*\)|[^]['""\n\r`!(){};:.,<>?\u00AB\u00BB\u201C\u201D\u2018\u2019]))";
See how it fails gracefully here.
Upvotes: 3