Tomasito665
Tomasito665

Reputation: 1209

Identifying bad escape character with regex

Say that we have an escape character \ which is only allowed to immediately precede another \. In other words, the escape character \ is only allowed to escape itself: \. Escaping any other character is considered to be a bad escape.

\foo      bad escape at position 0
\\foo     ok
\\\foo    bad escape at position 2
\\\\foo   ok
\\\\\foo  bad escape at position 4

I need to identify these bad escape characters, their position, and what they are trying to escape. We can assume that the input text does not contain newlines. Of course, I could iterate over groups of correct escapes until I find a bad one.

line = '\\\\\\'
i = 0

while i < len(line):
    curr_char = line[i]
    next_char = line[i+1] if i < len(line) - 1 else 'EOL'
    if curr_char == '\\':
        if next_char == '\\':
            i += 2
            continue
        else:
            print(f'bad escape at pos {i}: {next_char}')
            break
    else:
        i += 1

But I need a faster solution than this and that's why I would like to match the bad escape with a regular expression. My first - somewhat naive - approach was to match any backslash immediately succeeded by anything but a backslash: \\([^\\]|$).

import re
p = re.compile(r'\\([^\\]|$)')

p.search('\\')      # [ok] matches the only backslash
p.search('\\f')     # [ok] matches the only backslash
p.search('\\\\')    # [err] matches the correctly escaped backslash
p.search('\\\\\\')  # [ok] matches the last backslash, which indeed is a bad escape

Ok, so that doesn't work. The next logical thing to do seems to add a negative look-behind expression (?<!\\) to ignore escaped backslashes.

import re
p = re.compile(r'(?<!\\)\\([^\\]|$)')

p.search('\\')      # [ok] matches the only backslash
p.search('\\f')     # [ok] matches the only backslash
p.search('\\\\')    # [ok] does not match anything
p.search('\\\\\\')  # [err] does not match the bad escape (last backslash)

Another thing I could do is to use substitutions and substitute the bad escape with a placeholder, but that seems rather hacky and not super efficient any way... also, this solution screams "there must be a better way!" :-)

import re

def f_sub(match):
    value = match.group()
    if value == '\\\\':
        return value
    return '\x00'


# bad escape before "with", before "bad" and at the end of the line
line = 'text\\\\line \\with \\\\\\bad escapes\\'
line = re.sub(r'(\\\\)|(\\([^\\]|$))', f_sub, line)

print(line)
'text\\\\line \x00ith \\\\\x00ad escapes\x00'

Could anybody help me with this? Thanks a lot in advance!

Upvotes: 1

Views: 1573

Answers (1)

anubhava
anubhava

Reputation: 785531

You may use this regex with lookarounds:

(?<!\\)(?:\\{2})*(\\)(?!\\)

RegEx Demo

Code:

>>> reg = re.compile(r'(?<!\\)(?:\\{2})*(\\)(?!\\)')
>>> def badEsc(s):
...     m = reg.search(s)
...     if m:
...             print "bad escape at position " + str(m.start(1))
...     else:
...             print "ok"
...

Testing:

>>> badEsc(r'\foo')
bad escape at position 0
>>> badEsc(r'\\foo')
ok
>>> badEsc(r'\\\foo')
bad escape at position 2
>>> badEsc(r'\\\\foo')
ok
>>> badEsc(r'\\\\\foo')
bad escape at position 4

Upvotes: 1

Related Questions