Regular expresion gets really confusing when involving "r" and "/"

Question

In order to better understand how r and / work, I did some testing within Python's IDLE. The results are as follows:

    >>> 11111111111111111111111111111111111111111
    11111111111111111111111111111111111111111
    >>> re.sub(r'\?', r'\?', '?ttt?')
    '\?ttt\?'
    >>> re.sub('\?', r'\?', '?ttt?')
    '\?ttt\?'
    >>> re.sub('\?', r'\?', '?ttt?')
    '\?ttt\?'
    >>> re.sub(r'\?', r'\?', '?ttt?')
    '\??\?t\?t\?t\??\?'
    >>> 222222222222222222222222222222222222222
    222222222222222222222222222222222222222
    >>> re.sub(r'\?', r'\?', '?ttt?')
    '\?ttt\?'
    >>> re.sub(r'\?', '\?', '?ttt?')
    '\?ttt\?'
    >>> re.sub(r'\?', '\?', '?ttt?')
    '\?ttt\?'
    >>> re.sub(r'\?', r'\?', '?ttt?')
    '\?ttt\?'
    >>> 333333333333333333333333
    333333333333333333333333
    >>> re.sub(r'(\?)', r'\1\1', '?ttt?')
    '??ttt??'
    >>> re.sub(r'(\?)', '\1\1', '?ttt?')
    '\x01\x01ttt\x01\x01'
    >>>

For the first argument of re.sub():

How come r'\?', '\?', '\?'work the same, while r'\?' works differently?
And why does re.sub(r'\?', r'\?', '?ttt?') get such a weired result?

For the second argument of re.sub():

Why is that r'\?', '\?', '\?' and r'\?' all work the same?
When numbers are involved, the r is almost always necessary, such as r'\1\1' and '\1\1'. But I don't see why '\1' is the same as '\x01'.

anjsimmo · Accepted Answer

The first thing to understand is how Python treats escape sequences:

The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character

For example, in '\?', the fist backslash escapes the second backslash to produce just \?.

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.) It is also important to note that the escape sequences only recognized in string literals fall into the category of unrecognized escapes for bytes literals.

So in '\?', because \? isn't a valid escape sequence, Python leaves it as just \?. However, style checkers like Pylint will flag an "Anomalous backslash in string" warning.

When working with strings that contain lots of backslashes, you can use Python's raw string syntax to avoid the need to escape them:

>>> r'\?' == '\?' == '\?' # These are all equivalent
True
>>> r'\?' == r'\?'        # But r'\?' is different
False
>>> r'\?' == '\\?'
True

Numbers also have a special meaning when escaped (they are treated as octal):

>>> '\141' == 'a'
True

The second thing to understand is how Python treats the backslash in regular expressions:

Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \.

For example, in re.sub(r'\?', r'\?', '?ttt?'), the first argument r'\?' tells the regular expression library that we want to search for the ? character rather than treating ? as a qualifier. If we didn't use raw strings, we would need to write re.sub('\?', '\?', '?ttt?').

In contrast, in re.sub(r'\?', r'\?', '?ttt?'), the first argument r'\?' tells the regular expression library that we want to search for a \ character, and ? is a special regex qualifier that treats the previous character as optional (in this case, causing the pattern to always match). If we didn't use raw strings, we would need to write re.sub('\\?', '\?', '?ttt?').

In short, to match a literal backslash, one has to write '\\' as the RE string, because the regular expression must be \, and each backslash must be expressed as \ inside a regular Python string literal.

Yes, regular expressions with raw strings and \ get really confusing, but it's even more confusing without raw strings!

Additional details: The way backslashes are treated in the second argument to re.sub is a bit different to the first argument. The regular expression library still needs to provide a way to escape backslashes, else there would be no way to distinguish between between backreferences such as r'\1', and literally wanting to replace matches with a backslash followed by a one. However, other regular expression syntax such as ? doesn't need to be escaped, because it has no special meaning in the second argument. According to the documentation for the second argument of re.sub:

... any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as errors. Other unknown escapes such as \& are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.

So the regular expression library treats re.sub(r'\?', r'\?', '?ttt?') the same as re.sub(r'\?', r'\?', '?ttt?'), because \? is an unknown regular expression escape sequence for the second argument (even though it is a recognised regular expression escape sequence for the first argument). This is kind of similar to how Python treats '\?' the same as '\?', but is to do with how the regular expression library chooses to treat backslashes rather than anything to do with the Python syntax itself.

Regular expresion gets really confusing when involving "r" and "/"

Answers (1)

Related Questions

Regular expresion gets really confusing when involving &quot;r&quot; and &quot;/&quot;

Answers (1)

Related Questions

Regular expresion gets really confusing when involving "r" and "/"