Reputation: 283083
Regex:
(?|`(?>[^`\\]|\\.|``)*`|'(?>[^'\\]|\\.|'')*'|"(?>[^"\\]|\\.|"")*"|(\?{1,2})|(:{1,2})([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*))
Example input:
INSERT INTO xyz WHERE
a=?
and b="what?"
and ??="cheese"
and `col?`='OK'
and ::col='another'
and last!=:least
https://regex101.com/r/HnTVXx/6
It should match ?
, ??
, :xyz
and ::xyz
but not if they are inside of a backquoted-string, double-quoted string, or single-quoted string.
When I try running this in PHP with a very large input I get PREG_RECURSION_LIMIT_ERROR
from preg_last_error()
.
How can I simplify this regex pattern so that it doesn't do so much recursion?
Here's some test code that shows the error in PHP using Niet's optimized regex: https://3v4l.org/GdtmP Error code 6 is PREG_JIT_STACKLIMIT_ERROR
. The other one I've seen is 3=PREG_RECURSION_LIMIT_ERROR
Upvotes: 4
Views: 785
Reputation: 89584
Same idea to skip quoted parts (the (*SKIP)(*F)
combo), but also 2 techniques to reduce the regex engine work:
These 2 techniques have something in common: limiting the cost of alternations.
The first character discrimination is useful when your pattern starts with an alternation. The problem with an alternation at the beginning is that each branch should be tested so that a position where the pattern fails is identified. Since most of the time, there are many failing positions in a string, discarding them quickly constitutes a significant improvement.
For instance, something like: "...|'...|`...|:...
can also be written like this:
(?=["'`:])(?:"...|'...|`...|:...)
or
["'`:](?:(?<=")...|(?<=')...|(?<=`)...|(?<=:)...)
This way, each position that doesn't start with one of these characters ["'`:]
is immediately rejected with the first token without to test each branch.
The unrolled pattern consists to rewrite something like: " (?:[^"\\]|\\.)* "
into:
" [^"\\]* (?: \\. [^"\\]* )* "
Note that this design eliminates the alternation and reduces the number of steps drastically:
basic
unrolled
Using these 2 techniques, your pattern can be written like this:
~
[`'"?:]
(?:
(?<=`) [^`\\]*+ (?s:\\.[^`\\]*|``[^`\\]*)*+ ` (*SKIP) (*F)
|
(?<=') [^'\\]*+ (?s:\\.[^'\\]*|''[^'\\]*)*+ ' (*SKIP) (*F)
|
(?<=") [^"\\]*+ (?s:\\.[^"\\]*|""[^"\\]*)*+ " (*SKIP) (*F)
|
(?<=\?) \??
|
(?<=:) :? ([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)
)
~x
Other way: instead of using an alternation (improved or not) at the beginning, you can build a pattern that matches all the string with contiguous results. The general design is:
\G (all I don't want) (*SKIP) \K (what I am looking for)
\G
is an anchor that matches either the position after the previous result or the start of the string. Starting a pattern with it ensures that all the matches are contiguous. In this situation (at the beginning of the pattern and in factor to the whole pattern), you can also replace it with the A modifier.
That gives:
~
[^`'"?:]*
(?:
` [^`\\]*+ (?s:\\.[^`\\]*|``[^`\\]*)*+ ` [^`'"?:]*
|
' [^'\\]*+ (?s:\\.[^'\\]*|''[^'\\]*)*+ ' [^`'"?:]*
|
" [^"\\]*+ (?s:\\.[^"\\]*|""[^"\\]*)*+ " [^`'"?:]*
)*
\K # only the part of the match after this position is returned
(*SKIP) # if the next subpattern fails, the contiguity is broken at this position
(?:
\?{1,2}
|
:{1,2} ([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)
)
~Ax
Upvotes: 1
Reputation: 324750
The general idea of "match this thing, but not in this condition" can be achieved with this pattern:
(don't match this(*SKIP)(*FAIL)|match this)
In your case, you'd want something like...
(
(['"`]) # capture this quote character
(?:\\.|(?!\1).)*+ # any escaped character, or
# any character that isn't the captured one
\1 # the captured quote again
(*SKIP)(*FAIL) # ignore this
|
\?\?? # one or two question marks
|
::?\w+ # word characters marked with one or two colons
)x
https://regex101.com/r/HnTVXx/7
Upvotes: 3