Madison May
Madison May

Reputation: 2753

Python re.findall() doesn't terminate

Below I have a rather complicated regex that seems to never terminate. It's not a matter of simply taking a long time -- I've waited several minutes for a response with no luck.

Below is a bit of code that reproduces the problem:

import re    
link = re.compile(u'(?i)((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))', re.IGNORECASE)
text = "Check out this link http://ru.wikipedia.org/wiki/%D0%9B%D0%BE%D0%B2%D0%B5%D1%86_%D1%81%D0%BD%D0%BE%D0%B2_(%D0%B0%D0%BC%D1%83%D0%BB%D0%B5%D1%82"
matches = re.findall(link, text)

Here the output from passing re.DEBUG as a flag to the pattern:

subpattern 1
  subpattern None
    branch
      literal 104
      literal 116
      literal 116
      literal 112
      max_repeat 0 1
        literal 115
      literal 58
      literal 47
      literal 47
    or
      literal 119
      literal 119
      literal 119
      max_repeat 0 3
        in
          category category_digit
      literal 46
    or
      max_repeat 1 4294967295
        in
          range (97, 122)
          range (48, 57)
          literal 46
          literal 45
      literal 46
      max_repeat 2 4
        in
          range (97, 122)
      literal 47
  max_repeat 1 4294967295
    subpattern None
      branch
        max_repeat 1 4294967295
          in
            negate None
            category category_space
            literal 40
            literal 41
            literal 60
            literal 62
      or
        literal 40
        max_repeat 0 4294967295
          subpattern None
            branch
              max_repeat 1 4294967295
                in
                  negate None
                  category category_space
                  literal 40
                  literal 41
                  literal 60
                  literal 62
            or
              subpattern None
                literal 40
                max_repeat 1 4294967295
                  in
                    negate None
                    category category_space
                    literal 40
                    literal 41
                    literal 60
                    literal 62
                literal 41
        literal 41
  subpattern None
    branch
      literal 40
      max_repeat 0 4294967295
        subpattern None
          branch
            max_repeat 1 4294967295
              in
                negate None
                category category_space
                literal 40
                literal 41
                literal 60
                literal 62
          or
            subpattern None
              literal 40
              max_repeat 1 4294967295
                in
                  negate None
                  category category_space
                  literal 40
                  literal 41
                  literal 60
                  literal 62
              literal 41
      literal 41
    or
      in
        negate None
        category category_space
        literal 96
        literal 33
        literal 40
        literal 41
        literal 91
        literal 93
        literal 123
        literal 125
        literal 59
        literal 58
        literal 39
        literal 34
        literal 46
        literal 44
        literal 60
        literal 62
        literal 63
        literal 171
        literal 187
        literal 8220
        literal 8221
        literal 8216
        literal 8217

Upvotes: 1

Views: 514

Answers (2)

user557597
user557597

Reputation:

This probably won't help much. I don't know (but don't think so) if Python supports possesive quantifiers.
But if it does, the below stops it from catastrophic backtracking.
I tried to make an equivalent that stopped the backtracking without possesive's.
It did, but the meaning changed, didn't match the same.

Your regex is a simulated recursive regex in every detail. You can see the repeated pattern below.
That is the primary problem here, in that it has to be possesive in nature, because in reality, its matching optional balanced text and there is no boundry to stop it from backtracking.

I don't think there is a way to get an equavalent (non- backtracking) regex in this case.
So, if it can't do possesive quantifiers, the regex is doomed in any context its used.

Good luck!
A note- gigantic regexes are childs play if you use RegexFormat4. Its a cut'n paste and button click. Modify it, compress it, test it. Repeated several times using different constructs.

 #  (?i)((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*+\))+(?:\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*+\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))

 #  matches:  'http://ru.wikipedia.org/wiki/%D0%9B%D0%BE%D0%B2%D0%B5%D1%86_%D1%81%D0%BD%D0%BE%D0%B2_'

 (?i)
 (
      (?:
           https?://
        |  www \d{0,3} [.] 
        |  [a-z0-9.\-]+ [.] [a-z]{2,4} /
      )
      (?:
           [^\s()<>]+ 
        |  \(
           (?:
                [^\s()<>]+ 
             |  (?: \( [^\s()<>]+ \) )
           )*+                           #  <- Possesive
           \)
      )+
      (?:
           \(
           (?:
                [^\s()<>]+ 
             |  (?: \( [^\s()<>]+ \) )
           )*+                           #  <- Possesive
           \)
        |  [^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019] 
      )
 )

Upvotes: 1

brandonscript
brandonscript

Reputation: 72905

If you strip %D0%B5%D1%82 off the end of your string, it works (albeit in around 10s). As you increase the complexity of the string and the match, you're exponentially increasing the amount of processing required to check the conditions of your expressions, and you're likely spiking your CPU all the way up to 100%.

This is called catastrphic backtracking -- it looks to me like you've got somewhere between 99 and 100 problems ;)

enter image description here

The CPU is being blocked to the point where it can't continue to process the expression.

Solutions:

  • Simplify your expression (the more reliable and sensible option), or
  • Shorten/truncate the sample before trying to verify it (kind of defeats the purpose of using the expression in the first place)

I honestly don't know if this is going to cut it for you, but since you've got a LOT of overlapping conditionals, this appears to cut them down. Best check a bunch of samples to verify:

(?i)(?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)((?:\(?[^\s()<>]+\)?)*[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019])

Example: http://regex101.com/r/lV0oL4

I'm sure you can simplify it even more, this was just a quick stab at doing so. Processing time on your string: Less than 100ms.

Upvotes: 3

Related Questions