Tanguy
Tanguy

Reputation: 3304

understanding raw string for regular expressions in python

I have lots of text files full of newlines which I am parsing in python 3.4. I am looking for the newlines because they separate my text into different parts. Here is an example of a text :

text = 'avocat  ;\n\n       m. x'

I naïvely started looking for newlines with '\n' in my regular expression (RE) without thinking that the backslash '\' was an escape character. Howerver, this turned out to work fine:

>>> import re

>>> pattern1 = '\n\n'
>>> re.findall(pattern1, text)
['\n\n']

Then, I understood I should be using a double backslash in order to look for one backlash. This also worked fine:

>>> pattern2 = '\\n\\n'
>>> re.findall(pattern2, text)
['\n\n']

But on another thread, I was told to use raw strings instead of regular strings, but this format fails to find the newlines I am looking for:

>>> pattern3 = r'\\n\\n'
>>> pattern3
'\\\\n\\\\n'
>>> re.findall(pattern3, text)
[]

Could you please help me out here ? I am getting a little confused of what king of RE I should be using in order to correctly match the newlines.

Upvotes: 2

Views: 980

Answers (2)

Tanguy
Tanguy

Reputation: 3304

OK I got it. In this nice Python regex cheat sheet it says: "Special character escapes are much like those already escaped in Python string literals. Hence regex '\n' is same as regex '\\n'".

This is why pattern1 and pattern2 were matching my text in my previous example. However, pattern3 is looking for '\\n' in already interpreted text, which actually is '\\\\n' in canonical string representation.

Upvotes: 2

Assem
Assem

Reputation: 12107

Don't double the backslash when using raw string:

>>> pattern3 = r'\n\n'
>>> pattern3
'\\n\\n'
>>> re.findall(pattern3, text)
['\n\n']

Upvotes: 5

Related Questions