AnD
AnD

Reputation: 21

Regex : Catastrophic backtracking when processing large string

I need help for optimizing my regex for processing URL BBCode Tag. The regex is to check that URL tag has valid pattern and NOT containing whitelist protocol

#(\[url=(?:"|"|\'|)(((((?!https|http|ftp|mailto).)*):(//)?)([^\[\]]*))(?:"|"|\'|)\])(.*)(\[/url\])#siU

Regex will ignore :

And match when :

It's run well and has no issue, until user create string data with more than 10000 char length, that will make Catastrophic backtracking

Regex101 Reference Link

Upvotes: 2

Views: 191

Answers (1)

wp78de
wp78de

Reputation: 18980

Here is a slightly optimized version:

(?:\[url=(?:"|"|\'|)(?:(?:(?:(?:(?!https?|ftp|mailto).)*):(?://)?)(?:(?!"|"|&quote;).)++)(?:"|"|\'|)\])(?:(?!\[/url\]).)++(?:\[/url\])

The main optimizations here are:

  • changed most of the capture groups into non-capture groups (?:)
  • changed .* expressions no tempered greedy tokens/excludes (?:(?!).)
  • added some possessive quantifiers ++
  • (switching from protocol blacklist to a whitelist would also help a lot)

Demo

If you are going to use this pattern often it might be worth to mention the S|Study PHP regex flag. Guessing from the description, it should not be useful but might be still worth the trial. I have not tested it.

Sample Code


Regarding your updated sample: It's probably best to do this in a two step process: first, extract the URL meta tags with a much simpler regex, e.g.

\[url=.*\[/url\]

then, use your original regex or the one above to verify the input format.

Upvotes: 1

Related Questions