PAStheLoD
PAStheLoD

Reputation: 995

Why Python chokes on this regex?

This looks like a simple regex, no backreferences, no "any" characters, I'd even dare to say it's parseable by a Thomson DFA and all. It even works, but chokes on very simple non-matches.

{\s*?
ngx_string\("(?P<name>[a-z0-9_]+)"\)\s*?,\s*?
(?P<where>(([A-Z0-9_]+)\s*\|?)+?)\s*?,\s*?
(?P<bla>[^\n}]+?)\s*?,\s*?
(?P<bla2>[^\n}]+?)\s*?,\s*?
(?P<bla3>[^\n}]+?)\s*?,\s*?
(?P<bla4>[^\n}]+?)\s*?
}

+ re.MULTILINE | re.VERBOSE 

runnable gist here

I'm currently trying this on Python 2.7.8 (but the linked gist fails on py3.4 too; also linux, x86-64, Ubuntu, PCRE statically linked in [at least /proc//maps doesn't show anything interesting).

This parses well:

{ ngx_string("daemon"),
  NGX_MAIN_CONF|NGX_DIRECT_CONF|NGX_CONF_FLAG,
  ngx_conf_set_flag_slot,
  0,
  offsetof(ngx_core_conf_t, daemon),
  NULL },

And this is where the fun stops:

 { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE_OFF },
 { ngx_string("on"), NGX_HTTP_REQUEST_BODY_FILE_ON },

Also, more data:

by changing the second line to this

(?P<where>(([A-Z0-9_]{1,20})\s*\|?){1,6}?)\s{0,10}?,\s{0,10}?

, it finally completes in reasonable time, but the exponential blowing up is still there, just bearable:

trying      { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE
Took 0.033483 s
trying      { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE_
Took 0.038528 s
trying      { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE_O
Took 0.044108 s
trying      { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE_OF
Took 0.053547 s

Also, interestingly a JS-based Python regex (emulator?) parser can eat it like it's breakfast for PCRE champs: https://www.debuggex.com/r/S__vSvp8-LGLuCLQ

Oh, and maybe someone should create the pathological-regex tag :|

Upvotes: 1

Views: 100

Answers (2)

nhahtdh
nhahtdh

Reputation: 56809

The problem is (([A-Z0-9_]+)\s*\|?)+? in your original regex or (([A-Z0-9_]{1,20})\s*\|?){1,6}? in your test regex. The {1,20} and {1,6} only serves to inhibit the exponential backtracking a bit.

Since \s* and \|? are optional, the regex can expand to the classic example of exponential backtracking (([A-Z0-9_]+))+? when the input contains only [A-Z0-9_] without spaces or bar |, but does not matches the rest of the regex.

For example, matching (?P<where>(([A-Z0-9_]+)\s*\|?)+?)\s*?,\s*? against AAAAAAAAAAAAAA (, is missing) would cause the engine to check all possibility of splitting the string up, each token to be matched at different iterations of the outer repetition:

AAAAAAAAAAAAAA
AAAAAAAAAAAAA A
AAAAAAAAAAAA AA
AAAAAAAAAAAA A A
AAAAAAAAAAA AAA
...
A AAAAAAAAAAAAA
A AAAAAAAAAAAA A
A AAAAAAAAAAA AA
...
A A A A A A A A A A A A A A

On a closer look, the rest of your regex also have excessive backtracking problem.

Take (?P<bla2>[^\n}]+?)\s*?,\s*? for example. [^\n}] can match a space (or a tab, or a number of other space characters), and so can \s. This may cause excessive backtracking if your non-matching input contains a long string of spaces. There might also be correctness issue, since , can be matched by [^\n}] also.

On a side note, Python re is an NFA engine without any optimization to mitigate the exponential backtracking problem.

To mitigate the exponential backtracking:

{\s*
ngx_string\("(?P<name>[a-z0-9_]+)"\)\s*,\s*
(?P<where>[A-Z0-9_]+(?:\s*\|\s*[A-Z0-9_]+)*)\s*,\s*
(?P<bla>[^\n}]+?)\s*,\s*
(?P<bla2>[^\n}]+?)\s*,\s*
(?P<bla3>[^\n}]+?)\s*,\s*
(?P<bla4>[^\n}]+?)\s*
}

The excessive backtracking and correctness problem with [^\n}] are still not fixed. The , in function call can be wrongly recognized as part of the outer block {} if the arguments are not on different lines.

The general solution would require recursive regex, since you can call a function as pass its result as argument, e.g. offsetof(ngx_core_conf_t, daemon), but the recursive regex feature is not available in re package. A less general solution would be to match up to some levels of nested parentheses, for example 1 level of nesting:

(?P<bla>(?:\([^)]*\)|[^,()])+),\s*

And the whole regex is:

{\s*?
ngx_string\("(?P<name>[a-z0-9_]+)"\)\s*,\s*
(?P<where>[A-Z0-9_]+(?:\s*\|\s*[A-Z0-9_]+)*)\s*,
(?P<bla>(?:\([^)]*\)|[^,()])+),\s*
(?P<bla2>(?:\([^)]*\)|[^,()])+),\s*
(?P<bla3>(?:\([^)]*\)|[^,()])+),\s*
(?P<bla4>(?:\([^)]*\)|[^,()])+)
}

DEMO

There are 2 caveats:

  • The <bla*> capturing groups will contain spaces at the end. The regex will be a bit longer if you want to remove spaces, and prevent possible excessive backtracking. You can try adding \s* before the , back into this demo here.
  • It assumes that () are not part of any string literal.

Upvotes: 2

vks
vks

Reputation: 67968

(?P<where>(([A-Z0-9_]+)\s*\|?)+?)
                     ^        ^

This is where your regex is exploding.Read http://www.regular-expressions.info/catastrophic.html .

On every failure it goes back one step to check if there is a match.This is creating an explosion of steps and possibilities for the regex engine.

Upvotes: 2

Related Questions