Reputation: 191

Raw Strings, Python and re, Normal vs Special Characters

I'm encountering confusing and seemingly contradictory rules regarding raw strings. Consider the following example:

>>> text = 'm\n'
>>> match = re.search('m\n', text)
>>> print match.group()
m

>>> print text
m

This works, which is fine.

>>> text = 'm\n'
>>> match = re.search(r'm\n', text)
>>> print match.group()
m

>>> print text
m

Again, this works. But shouldn't this throw an error, because the raw string contains the characters m\n and the actual text contains a newline?

>>> text = r'm\n'
>>> match = re.search(r'm\n', text)
>>> print match.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> print text
m\n

The above, surprisingly, throws an error, even though both are raw strings. This means both contain just the text m\n with no newlines.

>>> text = r'm\n'
>>> match = re.search(r'm\\n', text)
>>> print text
m\n
>>> print match.group()
m\n

The above works, surprisingly. Why do I have to escape the backslash in the re.search, but not in the text itself?

Then there's backslash with normal characters that have no special behavior:

>>> text = 'm\&'
>>> match = re.search('m\&', text)
>>> print text
m\&
>>> print match.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

This doesn't match, even though both the pattern and the string lack special characters.

In this situation, no combination of raw strings works (text as a raw string, patterns as a raw string, both or none).

However, consider the last example. Escaping in the text variable, 'm\\&', doesn't work, but escaping in the pattern does. This parallels the behavior above--even stranger, I feel, considering that \& is of no special meaning to either Python or re:

>>> text = 'm\&'
>>> match = re.search(r'm\\&', text)
>>> print text
m\&
>>> print match.group()
m\&

My understanding of raw strings is that they inhibit the behavior of the backslash in python. For regular expressions, this is important because it allows re.search to apply its own internal backslash behavior, and prevent conflicts with Python. However, in situations like the above, where backslash effectively means nothing, I'm not sure why it seems necessary. Worse yet, I don't understand why I need to backslash for the pattern, but not the text, and when I make both a raw string, it doesn't seem to work.

The docs don't provide much guidance in this regard. They focus on examples with obvious problems, such as '\section', where \s is a meta-character. Looking for a complete answer to prevent unanticipated behavior such as this.

Upvotes: 4

Answers (4)

sirfz

Reputation: 4277

To explain it in simple terms, \<character> has a special meaning in regular expressions. For example \s for whitespace characters, \d for decimal digits, \n for new-line characters, etc.

When you define a string as

s = 'foo\n'

This string contains the characters f, o, o and the new-line character (length 4).

However, when defining a raw string:

s = r'foo\n'

This string contains the characters f, o, o, \ and n (length 5).

When you compile a regexp with raw \n (i.e. r'\n'), it'll match all new lines. Similarly, just using the new-line character (i.e. '\n') it's going to match new-line characters just like a matches a and so on.

Once you understand this concept, you should be able to figure out the rest.

To elaborate a bit further. In order to match the back-slash character \ using regex, the valid regular expression is \\ which in Python would be r'\\' or its equivalent '\\\\'.

Upvotes: 1

schesis

Reputation: 59128

In the regular Python string, 'm\n', the \n represents a single newline character, whereas in the raw string r'm\n' the \ and n are just themselves. So far, so simple.

If you pass the string 'm\n' as a pattern to re.search(), you're passing a two-character string (m followed by newline), and re will happily go and find instances of that two-character string for you.

If you pass the three-character string r'm\n', the re module itself will interpret the two characters \ n as having the special meaning "match a newline character", so that the whole pattern means "match an m followed by a newline", just as before.

In your third example, since the string r'm\n' doesn't contain a newline, there's no match:

>>> text = r'm\n'
>>> match = re.search(r'm\n', text)
>>> print(match)
None

With the pattern r'm\\n', you're passing two actual backslashes to re.search(), and again, the re module itself is interpreting the double backslash as "match a single backslash character".

In the case of 'm\&', something slightly different is going on. Python treats the backslash as a regular character, because it isn't part of an escape sequence. re, on the other hand, simply discards the \, so the pattern is effectively m&. You can see that this is true by testing the pattern against 'm&':

>>> re.search('m\&', 'm&').group()
'm&'

As before, doubling the backslash tells re to search for an actual backslash character:

>>> re.search(r'm\\&', 'm\&').group()
'm\\&'

... and just to make things a little more confusing, the single backslash is represented by Python doubled. You can see that it's actually a single backslash by printing it:

>>> print(re.search(r'm\\&', 'm\&').group())
m\&

Upvotes: 1

BPL

Reputation: 9863

In order to understand the internal representation of the strings you're confused about. I'd recommend you using repr and len builtin functions. Using those you'll be able to understand exactly how the strings are and you won't be confused anymore about pattern matching because you'll exactly know the internal representation. For instance, let's say you wanna analize the strings you're having troubles with:

use_cases = [
    'm\n',
    r'm\n',
    'm\\n',
    r'm\\n',
    'm\&',
    r'm\&',
    'm\\&',
    r'm\\&',
]

for u in use_cases:
    print('-' * 10)
    print(u, repr(u), len(u))

The output would be:

----------
m
 'm\n' 2
----------
m\n 'm\\n' 3
----------
m\n 'm\\n' 3
----------
m\\n 'm\\\\n' 4
----------
m\& 'm\\&' 3
----------
m\& 'm\\&' 3
----------
m\& 'm\\&' 3
----------
m\\& 'm\\\\&' 4

So you can see exactly the differences between normal/raw strings.

Upvotes: 0

vks

Reputation: 67968

text = r'm\n'
match = re.search(r'm\\n', text)

First line using r stops python from interpreting \n as single byte.

Second line using r plays the same role as first.Using \ prevents regex from interpreting as \n .Regex also uses \ like \s, \d.

The following characters are the meta characters that give special meaning to the regular expression search syntax:

\ the backslash escape character. The backslash gives special meaning to the character following it. For example, the combination "\n" stands for the newline, one of the control characters. The combination "\w" stands for a "word" character, one of the convenience escape sequences while "\1" is one of the substitution special characters. Example: The regex "aa\n" tries to match two consecutive "a"s at the end of a line, inclusive the newline character itself. Example: "a+" matches "a+" and not a series of one or "a"s.

Upvotes: 0

Raw Strings, Python and re, Normal vs Special Characters

Answers (4)

Related Questions