Reputation: 43
I have a python application which needs to process user submitted regular expressions. Because of performance considerations i want to forbid capturing groups and back references.
My idea is to use another regular expression to verify that the user submitted regex does not contain any named or unnamed group captures like this:
def validate_user_regex(pattern):
if re.match('[^\\\]\((?:\?P).*?[^\\\]\)', pattern) is not None:
return False
return True
While i think my idea may work for capturing groups, i am not sure if this would prevent all kinds of back references. So are there any smarter ways to prevent capturing groups and back references in regular expressions?
Upvotes: 2
Views: 132
Reputation: 36026
Regular expression language is not a regular language so it cannot be reliably split into meaningful parts by a regex (see RegEx match open tags except XHTML self-contained tags for the same case for HTML).
Why not use Python's own parser instead to do this?
>>> r="whate(ever)(?:\\1)"
>>> import sre_parse #the module used by `re' internally for regex parsing
>>> sre_parse.parse(r)
[('literal', 119), ('literal', 104), ('literal', 97), ('literal', 116),
('literal', 101), ('subpattern', (1, [('literal', 101), ('literal', 118), ('lit
eral', 101), ('literal', 114)])), ('subpattern', (None, [('groupref', 1)]))]
As you can see, this is a parse tree, and you're interested in subpattern
nodes with non-None
in the first element and groupref
's.
Upvotes: 2
Reputation: 43196
Your regex is very naive. (In fact, I had a hard time finding a group construct that is matched by your regex.)
To avoid false positives, like for example [(bar)]
, it's necessary to parse/match the entire pattern from left to right. I've come up with this regex:
^(?:\[(?:\\.|[^\]])*\]|\(\?:|[^(\\]|\\\D|\(\?[gmixsu])*$
Explanation:
^ # start of string anchor
(?: # this group matches a single valid expression:
# character classes:
\[ # the opening [
(?:
\\. # any escaped character
| # or
[^\]] # anything that's not a closing ]
)* # any number of times
\] # the closing ]
|
# non-capturing groups:
\(\?: # (?: literally
|
# normal characters (anything that's not a backslash or (
[^(\\]
|
# meta sequences like \s
\\\D
|
# inline modifiers like (?i)
\(\?[gmixsu]
)* # any number of valid expressions.
$ # end of string anchor
P.S.: This regex does not guarantee that the pattern is valid. (Compiling the pattern can still fail.)
Upvotes: 0