Reputation: 995
This looks like a simple regex, no backreferences, no "any" characters, I'd even dare to say it's parseable by a Thomson DFA and all. It even works, but chokes on very simple non-matches.
{\s*?
ngx_string\("(?P<name>[a-z0-9_]+)"\)\s*?,\s*?
(?P<where>(([A-Z0-9_]+)\s*\|?)+?)\s*?,\s*?
(?P<bla>[^\n}]+?)\s*?,\s*?
(?P<bla2>[^\n}]+?)\s*?,\s*?
(?P<bla3>[^\n}]+?)\s*?,\s*?
(?P<bla4>[^\n}]+?)\s*?
}
+ re.MULTILINE | re.VERBOSE
I'm currently trying this on Python 2.7.8 (but the linked gist fails on py3.4 too; also linux, x86-64, Ubuntu, PCRE statically linked in [at least /proc//maps doesn't show anything interesting).
This parses well:
{ ngx_string("daemon"),
NGX_MAIN_CONF|NGX_DIRECT_CONF|NGX_CONF_FLAG,
ngx_conf_set_flag_slot,
0,
offsetof(ngx_core_conf_t, daemon),
NULL },
And this is where the fun stops:
{ ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE_OFF },
{ ngx_string("on"), NGX_HTTP_REQUEST_BODY_FILE_ON },
Also, more data:
by changing the second line to this
(?P<where>(([A-Z0-9_]{1,20})\s*\|?){1,6}?)\s{0,10}?,\s{0,10}?
, it finally completes in reasonable time, but the exponential blowing up is still there, just bearable:
trying { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE
Took 0.033483 s
trying { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE_
Took 0.038528 s
trying { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE_O
Took 0.044108 s
trying { ngx_string("off"), NGX_HTTP_REQUEST_BODY_FILE_OF
Took 0.053547 s
Also, interestingly a JS-based Python regex (emulator?) parser can eat it like it's breakfast for PCRE champs: https://www.debuggex.com/r/S__vSvp8-LGLuCLQ
Oh, and maybe someone should create the pathological-regex tag :|
Upvotes: 1
Views: 100
Reputation: 56809
The problem is (([A-Z0-9_]+)\s*\|?)+?
in your original regex or (([A-Z0-9_]{1,20})\s*\|?){1,6}?
in your test regex. The {1,20}
and {1,6}
only serves to inhibit the exponential backtracking a bit.
Since \s*
and \|?
are optional, the regex can expand to the classic example of exponential backtracking (([A-Z0-9_]+))+?
when the input contains only [A-Z0-9_]
without spaces or bar |
, but does not matches the rest of the regex.
For example, matching (?P<where>(([A-Z0-9_]+)\s*\|?)+?)\s*?,\s*?
against AAAAAAAAAAAAAA
(,
is missing) would cause the engine to check all possibility of splitting the string up, each token to be matched at different iterations of the outer repetition:
AAAAAAAAAAAAAA
AAAAAAAAAAAAA A
AAAAAAAAAAAA AA
AAAAAAAAAAAA A A
AAAAAAAAAAA AAA
...
A AAAAAAAAAAAAA
A AAAAAAAAAAAA A
A AAAAAAAAAAA AA
...
A A A A A A A A A A A A A A
On a closer look, the rest of your regex also have excessive backtracking problem.
Take (?P<bla2>[^\n}]+?)\s*?,\s*?
for example. [^\n}]
can match a space (or a tab, or a number of other space characters), and so can \s
. This may cause excessive backtracking if your non-matching input contains a long string of spaces. There might also be correctness issue, since ,
can be matched by [^\n}]
also.
On a side note, Python re
is an NFA engine without any optimization to mitigate the exponential backtracking problem.
To mitigate the exponential backtracking:
{\s*
ngx_string\("(?P<name>[a-z0-9_]+)"\)\s*,\s*
(?P<where>[A-Z0-9_]+(?:\s*\|\s*[A-Z0-9_]+)*)\s*,\s*
(?P<bla>[^\n}]+?)\s*,\s*
(?P<bla2>[^\n}]+?)\s*,\s*
(?P<bla3>[^\n}]+?)\s*,\s*
(?P<bla4>[^\n}]+?)\s*
}
The excessive backtracking and correctness problem with [^\n}]
are still not fixed. The ,
in function call can be wrongly recognized as part of the outer block {}
if the arguments are not on different lines.
The general solution would require recursive regex, since you can call a function as pass its result as argument, e.g. offsetof(ngx_core_conf_t, daemon)
, but the recursive regex feature is not available in re
package. A less general solution would be to match up to some levels of nested parentheses, for example 1 level of nesting:
(?P<bla>(?:\([^)]*\)|[^,()])+),\s*
And the whole regex is:
{\s*?
ngx_string\("(?P<name>[a-z0-9_]+)"\)\s*,\s*
(?P<where>[A-Z0-9_]+(?:\s*\|\s*[A-Z0-9_]+)*)\s*,
(?P<bla>(?:\([^)]*\)|[^,()])+),\s*
(?P<bla2>(?:\([^)]*\)|[^,()])+),\s*
(?P<bla3>(?:\([^)]*\)|[^,()])+),\s*
(?P<bla4>(?:\([^)]*\)|[^,()])+)
}
There are 2 caveats:
<bla*>
capturing groups will contain spaces at the end. The regex will be a bit longer if you want to remove spaces, and prevent possible excessive backtracking. You can try adding \s*
before the ,
back into this demo here.()
are not part of any string literal.Upvotes: 2
Reputation: 67968
(?P<where>(([A-Z0-9_]+)\s*\|?)+?)
^ ^
This is where your regex is exploding.Read http://www.regular-expressions.info/catastrophic.html .
On every failure it goes back one step to check if there is a match.This is creating an explosion of steps and possibilities for the regex engine.
Upvotes: 2