Why my Python regular expression pattern run so slowly?

Question

Please see my regular expression pattern code:

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import re

print 'Start'
str1 = 'abcdefgasdsdfswossdfasdaef'
m = re.match(r"([A-Za-z\-\s\:\.]+)+(\d+)\w+", str1) # Want to match something like 'Moto 360x'
print m # None is expected.
print 'Done'

It takes 49 seconds to finish, any problem with the pattern?

ivan_pozdeev · Accepted Answer

See Runaway Regular Expressions: Catastrophic Backtracking.

In brief, if there are extremely many combinations a substring can be split into the parts of the regex, the regex matcher may end up trying them all.

Constructs like (x+)+ and x+x+ practically guarantee this behaviour.

To detect and fix the problematic constructs, the following concept can be used:

At conceptual level, the presence of a problematic construct means that your regex is ambiguous - i.e. if you disregard greedy/lazy behaviour, there's no single "correct" split of some text into the parts of the regex (or, equivalently, a subexpression thereof). So, to avoid/fix the problems, you need to see and eliminate all ambiguities.
- One way to do this is to
  - always split the text into its meaningful parts (=parts that have separate meanings for the task at hand), and
  - define the parts in such a way that they cannot be confused (=using the same characteristics that you yourself would use to tell which is which if you were parsing it by hand)

Why my Python regular expression pattern run so slowly?

Answers (2)

Related Questions