Daniel Gonzalez
Daniel Gonzalez

Reputation: 43

Find capturing groups in user submitted regex

I have a python application which needs to process user submitted regular expressions. Because of performance considerations i want to forbid capturing groups and back references.

My idea is to use another regular expression to verify that the user submitted regex does not contain any named or unnamed group captures like this:

def validate_user_regex(pattern):
    if re.match('[^\\\]\((?:\?P).*?[^\\\]\)', pattern) is not None:
        return False
    return True

While i think my idea may work for capturing groups, i am not sure if this would prevent all kinds of back references. So are there any smarter ways to prevent capturing groups and back references in regular expressions?

Upvotes: 2

Views: 132

Answers (2)

ivan_pozdeev
ivan_pozdeev

Reputation: 36026

Regular expression language is not a regular language so it cannot be reliably split into meaningful parts by a regex (see RegEx match open tags except XHTML self-contained tags for the same case for HTML).

Why not use Python's own parser instead to do this?

>>> r="whate(ever)(?:\\1)"
>>> import sre_parse        #the module used by `re' internally for regex parsing
>>> sre_parse.parse(r)
[('literal', 119), ('literal', 104), ('literal', 97), ('literal', 116),
 ('literal', 101), ('subpattern', (1, [('literal', 101), ('literal', 118), ('lit
eral', 101), ('literal', 114)])), ('subpattern', (None, [('groupref', 1)]))]

As you can see, this is a parse tree, and you're interested in subpattern nodes with non-None in the first element and groupref's.

Upvotes: 2

Aran-Fey
Aran-Fey

Reputation: 43196

Your regex is very naive. (In fact, I had a hard time finding a group construct that is matched by your regex.)

To avoid false positives, like for example [(bar)], it's necessary to parse/match the entire pattern from left to right. I've come up with this regex:

^(?:\[(?:\\.|[^\]])*\]|\(\?:|[^(\\]|\\\D|\(\?[gmixsu])*$

regex101 demo.


Explanation:

^   # start of string anchor
(?:  # this group matches a single valid expression:
# character classes:
    \[  # the opening [
    (?: 
        \\.  # any escaped character
    | # or 
        [^\]] # anything that's not a closing ]
    )* # any number of times
    \]  # the closing ]
|
# non-capturing groups:
    \(\?:  # (?: literally
|
# normal characters (anything that's not a backslash or (
    [^(\\]
|
# meta sequences like \s
    \\\D
|
# inline modifiers like (?i)
    \(\?[gmixsu]
)* # any number of valid expressions.
$  # end of string anchor

P.S.: This regex does not guarantee that the pattern is valid. (Compiling the pattern can still fail.)

Upvotes: 0

Related Questions