Why is my Python regular expression checking for more than one group taking so long?

Question

This question originates in Django URL resolver, but the problem seems to be a general one.

I want to match URLs built like this:

1,2,3,4,5,6/10,11,12/

The regular expression I'm using is:

^(?P([0123456789]+,?)+)/(?P([0123456789]+,?)+)/$

When I try to match it against a "valid" URL (i.e. one that matches), I get an instant match:

In [11]: print datetime.datetime.now(); re.compile(r"^(?P([0123456789]+,?)+)/(?P([0123456789]+,?)+)/$").search("114,414,415,416,417,418,419,420,113,410,411,412,413/1/"); print datetime.datetime.now()
2011-10-18 14:27:42.087883
Out[11]: <_sre.SRE_Match object at 0x2ab0960>
2011-10-18 14:27:42.088145

However, when I try to match an "invalid" URL (non-matching), the whole regular expression takes a magnitude of time to return nothing:

In [12]: print datetime.datetime.now(); re.compile(r"^(?P([0123456789]+,?)+)/(?P([0123456789]+,?)+)/").search("114,414,415,416,417,418,419,420,113,410,411,412,413/"); print datetime.datetime.now()
2011-10-18 14:29:21.011342
2011-10-18 14:30:00.573270

I assume there is something in the regexp engine that slows down extremely when several groups need to be matched. Is there any workaround for this? Maybe my regexp needs to be fixed?

Dietrich Epp · Accepted Answer

This is a known deficiency in many regular expression engines, including Python's and Perl's. What is happening is the engine is backtracking and getting an exponential explosion of possible matches to try. Better regular expression engines do not use backtracking for such a simple regular expression.

You can fix it by getting rid of the optional comma. This is what is allowing the engine to look at a string like 123 and decide whether to parse it as (123) or (12)(3) or (1)(23) or (1)(2)(3). That's a lot of matches to try just for three digits, so you can see how it would explode rather quickly for a couple dozen digits.

^(?P[0-9]+(,[0-9]+)*)/(?P[0-9]+(,[0-9]+)*)/$

This will make the regular expression engine always group 123,456 as (123),(456) and never as (12)(3),(4)(56) or something else. Because it will only match in that one way, the backtracking engine won't hit a combinatorial explosion of possible parses. Again, better regular expression engines do not suffer from this flaw.

Update: If I were writing it, I would do it this way:

^(?P[0-9,]+)/(?P[0-9,]+)$

This would match a few bogus URLs (like ,/,), but you can always return a 404 after you've parsed and routed it.

try:
    apples = [int(x) for x in apples.split(',')]
except ValueError:
    # return 404 error

Why is my Python regular expression checking for more than one group taking so long?

Answers (2)

Related Questions