user5003555
user5003555

Reputation:

Why is this regex experiencing catastrophic backtracking?

I've read the articles and other questions on catastrophic backtracking in regular expressions and how it can be caused by nested + and * quantifiers. However, my regexes are still encountering catastrophic backtracking without nested quantifiers. Can someone help me understand why?

I wrote these regexes to search for a specific type of rhyme in lines of welsh poetry. The rhyme consists of all the consonants in the beginning of the line being repeated at the end, and there must be a space between the beginning and end consonants. I've already removed all the vowels, but there are two exceptions that make these regexes ugly. First, there are allowed to be consonants in the middle that don't repeat, and if there are any, it's a different type of rhyme. Second, the letters m, n, r, h, and v are allowed to interrupt the rhyme (appear in the beginning but not in the end or vice versa), but they can't be ignored because sometimes the rhyme consists only of those letters.

My script automatically builds a regex for each line and tests it. It works the rest of the time, but this one line is giving catastrophic backtracking. The line's text without vowels is:

nn  Frvvn  Frv v

The regex automatically finds that nn Frvvn rhymes with Frv v, so then it tries it again with the last letter (the n in Frvvn) required in the back. If it's not required, then the rhyme can be shortened. Here's the regex:

^(?P<s_letters>         # starting letters
[mnrhv]*?\s*n{0,2}      # any number of optional letters or any number
                        # of spaces can come between rhyming letters
[mnrhv]*?\s*n{0,2}
[mnrhv]*?\s*F{1,2}
[mnrhv]*?\s*[rR]?(?:\s*[rR])? # r can also rhyme with R, but that's
                              # not relevant here (I think)
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*n{1,2}
[mnrhv\s]*?)
(?P<m_letters>          # middle letters
[^\s]*?(?P<caesura>\s)  # the caesura (end of the rhyme) is the
                        # first space after the rhyme     
.*)                     # End letters come as late as possible
(?P<e_letters>          # End group
[mnrhv]*?\s*n{0,2}
[mnrhv]*?\s*n{0,2}
[mnrhv]*?\s*F{1,2}
[mnrhv]*?\s*[rR]?(?:\s*[rR])?
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*n{1,2}
[mnrhv\s]*?)$

Even though it doesn't have any nested quantifiers, it still takes forever to run. Regexes for other lines that were generated in the same way run quickly. Why is this?

Upvotes: 1

Views: 183

Answers (1)

user2357112
user2357112

Reputation: 281683

I'm not seeing any nested quantifiers, but I am seeing a lot of ambiguities that would cause high-exponent polynomial runtime. For example, consider this part of the regex:

[mnrhv]*?\s*[rR]?(?:\s*[rR])? # r can also rhyme with R, but that's
                              # not relevant here (I think)
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*n{1,2}
[mnrhv\s]*?)
(?P<m_letters>          # middle letters
[^\s]*?(?P<caesura>\s)  # the caesura (end of the rhyme) is the

Suppose the regex engine is at this point, and the text it's seeing is just a huge block of ns. Those ns can be divided between the following parts of the regex:

[mnrhv]*?\s*[rR]?(?:\s*[rR])?
^^^^^^^^^

[mnrhv]*?\s*v{0,2}
^^^^^^^^^

[mnrhv]*?\s*v{0,2}
^^^^^^^^^
[mnrhv]*?\s*n{1,2}
^^^^^^^^^   ^^^^^^
[mnrhv\s]*?)
^^^^^^^^^^^
(?P<m_letters>
[^\s]*?(?P<caesura>\s)
^^^^^^^

If the number of ns is N, then there are O(N**6) ways to divide the ns, since there are 6 *? blocks that match n here, and everything in between is optional or also matches n.

Are those \s parts mandatory? If so, you might be able to improve runtime by putting a + instead of a * on them.

Upvotes: 3

Related Questions