Alexandre Beato
Alexandre Beato

Reputation: 126

Solve Catastrophic Backtracking in my regex detecting links?

I have an regex expression to find links in texts:

(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^'\"\\n\\r()<>]+|\\(([^'\"\\n\\r()<>]+|(\\([^'\"\\n\\r()<>]+\\)))*\\))+(?:\\(([^'\"\\n\\r()<>]+|(\\([^'\"\\n\\r()<>]+\\)))*\\)|[^'\"\\n\\r`!()\\[\\]{};:'.,<>?\u00AB\u00BB\u201C\u201D\u2018\u2019]))

But some ( in a link is causing an thread lock. Searching the Internet I've found some website suggesting that's a Catastrophic Backtracking problem. I've spent some time to optimize the pattern but it does not work. Any ideas?

Example input link that is causing the problem:

https://subdomain.domain.com/web/?id=-%c3%a1(%c2%81y%e2%80%9a%c3%a5d%e2%80%ba%c3%a8%c2%a7%c2%be.%c3%a9+%c2%a8

Upvotes: 1

Views: 211

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You should keep to the principle: all subsequent adjoining subpatterns cannot match at the same location in the string. If you quantify them with * or ?, make sure those obligatory patterns before them do not match the same text. Else, revamp the pattern. Or make use of atomic groupings.

The (?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/) part is an alternation where both can match at the same location in the string. This cannot be avoided, so use an atomic group to prevent backtracking into the pattern.

Look at [a-z0-9.-]+[.], . is present in the + quantified character class. Make it more linear, replace with [a-z0-9-]*(?:\.[a-z0-9-]*)*\..

The (?:[^'"\n\r()<>]+|\(([^'"\n\r()<>]+|(\([^'"\n\r()<>]+\)))*\))+ part is a buggy pattern: [^'"\n\r()<>]+ is + quantified, and again, and it leads to situations when the regex engine reduces it to (?:a+)+, a classical CA scenario. Use atomic groupings if you do not want to re-vamp, although it seems to be a part of a pattern matching balanced parentheses and can be re-written as [^'"\n\r()<>]*(?:\((?>[^()]+|(?<o>\()|(?<-o>\)))*(?(o)(?!))\)[^'"\n\r()<>]*)*.

The ([^'"\n\r()<>]+|(\([^'"\n\r()<>]+\)))* part is similar to the part above, change ( to (?> where you quantify the group and the single obligatory pattern inside it.

The fixed pattern is

var pattern = @"(?i)\b((?>https?://|www\d{0,3}\.|[a-z0-9-]*(?:\.[a-z0-9-]*)*\.[a-z]{2,4}/)(?>[^'""\n\r()<>]+|\((?>[^'""\n\r()<>]+|\([^'""\n\r()<>]+\))*\))+(?:\((?>[^'""\n\r()<>]+|(\([^'""\n\r()<>]+\)))*\)|[^]['""\n\r`!(){};:.,<>?\u00AB\u00BB\u201C\u201D\u2018\u2019]))";

See how it fails gracefully here.

Upvotes: 3

Related Questions