Identifying bad escape character with regex

Question

Say that we have an escape character \ which is only allowed to immediately precede another \. In other words, the escape character \ is only allowed to escape itself: \. Escaping any other character is considered to be a bad escape.

\foo      bad escape at position 0
\foo     ok
\\foo    bad escape at position 2
\\foo   ok
\\\foo  bad escape at position 4

I need to identify these bad escape characters, their position, and what they are trying to escape. We can assume that the input text does not contain newlines. Of course, I could iterate over groups of correct escapes until I find a bad one.

line = '\\\'
i = 0

while i < len(line):
    curr_char = line[i]
    next_char = line[i+1] if i < len(line) - 1 else 'EOL'
    if curr_char == '\':
        if next_char == '\':
            i += 2
            continue
        else:
            print(f'bad escape at pos {i}: {next_char}')
            break
    else:
        i += 1

But I need a faster solution than this and that's why I would like to match the bad escape with a regular expression. My first - somewhat naive - approach was to match any backslash immediately succeeded by anything but a backslash: $[^\]|$).

import re
p = re.compile(r'\([^\]|$)')

p.search('\')      # [ok] matches the only backslash
p.search('\f')     # [ok] matches the only backslash
p.search('\\')    # [err] matches the correctly escaped backslash
p.search('\\\')  # [ok] matches the last backslash, which indeed is a bad escape

Ok, so that doesn't work. The next logical thing to do seems to add a negative look-behind expression (? to ignore escaped backslashes.



import re
p = re.compile(r'(?


Another thing I could do is to use substitutions and substitute the bad escape with a placeholder, but that seems rather hacky and not super efficient any way... also, this solution screams "there must be a better way!" :-)

import re

def f_sub(match):
    value = match.group()
    if value == '\\':
        return value
    return '\x00'


# bad escape before "with", before "bad" and at the end of the line
line = 'text\\line \with \\\bad escapes\'
line = re.sub(r'(\$|(\([^\]|$))', f_sub, line)

print(line)
'text\\line \x00ith \\\x00ad escapes\x00'


Could anybody help me with this?
Thanks a lot in advance!

anubhava · Accepted Answer

You may use this regex with lookarounds:

(?



RegEx Demo

Code:

>>> reg = re.compile(r'(?>> def badEsc(s):
...     m = reg.search(s)
...     if m:
...             print "bad escape at position " + str(m.start(1))
...     else:
...             print "ok"
...


Testing:

>>> badEsc(r'\foo')
bad escape at position 0
>>> badEsc(r'\foo')
ok
>>> badEsc(r'\\foo')
bad escape at position 2
>>> badEsc(r'\\foo')
ok
>>> badEsc(r'\\\foo')
bad escape at position 4

Identifying bad escape character with regex

Answers (1)

Related Questions