samwyse
samwyse

Reputation: 2996

How to find if a regex contains non-escaped metacharacters?

I have a list of regexes from which I want to extract those that are equivalent to a string comparison.

For example, those regexes are equivalent to a simple string comparison:

[r"example",   # No metacharacters
 r"foo\.bar"]  # . is not a metacharacter because it is escaped

while those regexes are not:

[r"e.ample",   # . is a metacharacter
 r"foo\\.bar"] # . is a metacharacter because it is not escaped

According to https://docs.python.org/2/howto/regex.html, the list of valid metacharacters is . ^ $ * + ? { } [ ] \ | ( ).

I'm about to build a regex, but it looks to be a bit complicated. I'm wondering if there's a shortcut by examining the re object or something.

Upvotes: 4

Views: 248

Answers (2)

anubhava
anubhava

Reputation: 785196

Here is a regex that you can use to detect non-escaped metacharacters in python:

>>> rex = re.compile(r'^([^\\]*)(\\.[^.^$*+?{}\[\]|()\\]*)*[.^$*+?{}\[\]|()]',re.MULTILINE)

>>> arr = [r"example", r"foo\.bar", r"e.ample", r"foo\\.bar", r"foo\\bar\.baz"]

>>> for s in arr:
...     print s, re.search(rex, s) != None
...

Above regex scans the input for any escaping using \ and then it ignores the character that comes next to \. Finally it searches for a meta-character which is one of the:

. ^ $ * + ? { } [ ] | ( ) \ ]

characters without preceding \.

Output:

example False
foo\.bar False
e.ample True
foo\\.bar True
foo\\bar\.baz False

Code Demo

Upvotes: 2

Tim Pietzcker
Tim Pietzcker

Reputation: 336168

Inspired by Keith Hall's comment, here's a solution based on an undocumented feature of Python's regex compiler:

import re, sys, io

def contains_meta(regex):
    stdout = sys.stdout            # remember stdout
    sys.stdout = io.StringIO()     # redirect stdout to string
    re.compile(regex, re.DEBUG)    # compile the regex for the debug tree side effect
    output = sys.stdout.getvalue() # get that debug tree
    sys.stdout = stdout            # restore stdout
    return not all(line.startswith("LITERAL ") for line in output.strip().split("\n"))

Output:

In [9]: contains_meta(r"example")
Out[9]: False

In [10]: contains_meta(r"ex.mple")
Out[10]: True

In [11]: contains_meta(r"ex\.mple")
Out[11]: False

In [12]: contains_meta(r"ex\\.mple")
Out[12]: True

In [13]: contains_meta(r"ex[.]mple")  # single-character charclass --> literal
Out[13]: False

In [14]: contains_meta(r"ex[a-z]mple")
Out[14]: True

In [15]: contains_meta(r"ex[.,]mple")
Out[15]: True

Upvotes: 6

Related Questions