Why is this regex experiencing catastrophic backtracking?

Question

I've read the articles and other questions on catastrophic backtracking in regular expressions and how it can be caused by nested + and * quantifiers. However, my regexes are still encountering catastrophic backtracking without nested quantifiers. Can someone help me understand why?

I wrote these regexes to search for a specific type of rhyme in lines of welsh poetry. The rhyme consists of all the consonants in the beginning of the line being repeated at the end, and there must be a space between the beginning and end consonants. I've already removed all the vowels, but there are two exceptions that make these regexes ugly. First, there are allowed to be consonants in the middle that don't repeat, and if there are any, it's a different type of rhyme. Second, the letters m, n, r, h, and v are allowed to interrupt the rhyme (appear in the beginning but not in the end or vice versa), but they can't be ignored because sometimes the rhyme consists only of those letters.

My script automatically builds a regex for each line and tests it. It works the rest of the time, but this one line is giving catastrophic backtracking. The line's text without vowels is:

nn  Frvvn  Frv v

The regex automatically finds that nn Frvvn rhymes with Frv v, so then it tries it again with the last letter (the n in Frvvn) required in the back. If it's not required, then the rhyme can be shortened. Here's the regex:

^(?P         # starting letters
[mnrhv]*?\s*n{0,2}      # any number of optional letters or any number
                        # of spaces can come between rhyming letters
[mnrhv]*?\s*n{0,2}
[mnrhv]*?\s*F{1,2}
[mnrhv]*?\s*[rR]?(?:\s*[rR])? # r can also rhyme with R, but that's
                              # not relevant here (I think)
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*n{1,2}
[mnrhv\s]*?)
(?P          # middle letters
[^\s]*?(?P\s)  # the caesura (end of the rhyme) is the
                        # first space after the rhyme     
.*)                     # End letters come as late as possible
(?P          # End group
[mnrhv]*?\s*n{0,2}
[mnrhv]*?\s*n{0,2}
[mnrhv]*?\s*F{1,2}
[mnrhv]*?\s*[rR]?(?:\s*[rR])?
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*n{1,2}
[mnrhv\s]*?)$

Even though it doesn't have any nested quantifiers, it still takes forever to run. Regexes for other lines that were generated in the same way run quickly. Why is this?

Why is this regex experiencing catastrophic backtracking?

Answers (1)

Related Questions