Reputation: 2753
Below I have a rather complicated regex that seems to never terminate. It's not a matter of simply taking a long time -- I've waited several minutes for a response with no luck.
Below is a bit of code that reproduces the problem:
import re
link = re.compile(u'(?i)((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))', re.IGNORECASE)
text = "Check out this link http://ru.wikipedia.org/wiki/%D0%9B%D0%BE%D0%B2%D0%B5%D1%86_%D1%81%D0%BD%D0%BE%D0%B2_(%D0%B0%D0%BC%D1%83%D0%BB%D0%B5%D1%82"
matches = re.findall(link, text)
Here the output from passing re.DEBUG as a flag to the pattern:
subpattern 1
subpattern None
branch
literal 104
literal 116
literal 116
literal 112
max_repeat 0 1
literal 115
literal 58
literal 47
literal 47
or
literal 119
literal 119
literal 119
max_repeat 0 3
in
category category_digit
literal 46
or
max_repeat 1 4294967295
in
range (97, 122)
range (48, 57)
literal 46
literal 45
literal 46
max_repeat 2 4
in
range (97, 122)
literal 47
max_repeat 1 4294967295
subpattern None
branch
max_repeat 1 4294967295
in
negate None
category category_space
literal 40
literal 41
literal 60
literal 62
or
literal 40
max_repeat 0 4294967295
subpattern None
branch
max_repeat 1 4294967295
in
negate None
category category_space
literal 40
literal 41
literal 60
literal 62
or
subpattern None
literal 40
max_repeat 1 4294967295
in
negate None
category category_space
literal 40
literal 41
literal 60
literal 62
literal 41
literal 41
subpattern None
branch
literal 40
max_repeat 0 4294967295
subpattern None
branch
max_repeat 1 4294967295
in
negate None
category category_space
literal 40
literal 41
literal 60
literal 62
or
subpattern None
literal 40
max_repeat 1 4294967295
in
negate None
category category_space
literal 40
literal 41
literal 60
literal 62
literal 41
literal 41
or
in
negate None
category category_space
literal 96
literal 33
literal 40
literal 41
literal 91
literal 93
literal 123
literal 125
literal 59
literal 58
literal 39
literal 34
literal 46
literal 44
literal 60
literal 62
literal 63
literal 171
literal 187
literal 8220
literal 8221
literal 8216
literal 8217
Upvotes: 1
Views: 514
Reputation:
This probably won't help much. I don't know (but don't think so) if Python supports possesive quantifiers.
But if it does, the below stops it from catastrophic backtracking.
I tried to make an equivalent that stopped the backtracking without possesive's.
It did, but the meaning changed, didn't match the same.
Your regex is a simulated recursive regex in every detail. You can see the repeated pattern below.
That is the primary problem here, in that it has to be possesive in nature, because in reality, its matching optional balanced text and there is no boundry to stop it from backtracking.
I don't think there is a way to get an equavalent (non- backtracking) regex in this case.
So, if it can't do possesive quantifiers, the regex is doomed in any context its used.
Good luck!
A note- gigantic regexes are childs play if you use RegexFormat4. Its a cut'n paste and button click. Modify it, compress it, test it. Repeated several times using different constructs.
# (?i)((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*+\))+(?:\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*+\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))
# matches: 'http://ru.wikipedia.org/wiki/%D0%9B%D0%BE%D0%B2%D0%B5%D1%86_%D1%81%D0%BD%D0%BE%D0%B2_'
(?i)
(
(?:
https?://
| www \d{0,3} [.]
| [a-z0-9.\-]+ [.] [a-z]{2,4} /
)
(?:
[^\s()<>]+
| \(
(?:
[^\s()<>]+
| (?: \( [^\s()<>]+ \) )
)*+ # <- Possesive
\)
)+
(?:
\(
(?:
[^\s()<>]+
| (?: \( [^\s()<>]+ \) )
)*+ # <- Possesive
\)
| [^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]
)
)
Upvotes: 1
Reputation: 72905
If you strip %D0%B5%D1%82
off the end of your string, it works (albeit in around 10s). As you increase the complexity of the string and the match, you're exponentially increasing the amount of processing required to check the conditions of your expressions, and you're likely spiking your CPU all the way up to 100%.
This is called catastrphic backtracking -- it looks to me like you've got somewhere between 99 and 100 problems ;)
The CPU is being blocked to the point where it can't continue to process the expression.
Solutions:
I honestly don't know if this is going to cut it for you, but since you've got a LOT of overlapping conditionals, this appears to cut them down. Best check a bunch of samples to verify:
(?i)(?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)((?:\(?[^\s()<>]+\)?)*[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019])
Example: http://regex101.com/r/lV0oL4
I'm sure you can simplify it even more, this was just a quick stab at doing so. Processing time on your string: Less than 100ms.
Upvotes: 3