Reputation: 35
I would like to know the reason I get the same result when using string prefix "r" or not when looking for a period (full stop) using python regex.
After reading a number sources (Links below) a multiple times and experimenting with in code to find the same result (again see below), I am still unsure of:
re.compile("\.").sub("!", "blah.")
'blah!'
re.compile(r"\.").sub("!", "blah.")
'blah!'
re.compile(r"\.").search("blah.").group()
'.'
re.compile("\.").search("blah.").group()
'.'
Sources I have looked at:
Python docs: string literals http://docs.python.org/2/reference/lexical_analysis.html#string-literals
Regular expression to replace "escaped" characters with their originals
r prefix is for raw strings http://forums.udacity.com/questions/7000217/r-prefix-is-for-raw-strings
Upvotes: 3
Views: 892
Reputation: 1123560
The raw string notation is just that, a notation to specify a string value. The notation results in different string values when it comes to backslash escapes recognized by the normal string notation. Because regular expressions also attach meaning to the backslash character, raw string notation is quite handy as it avoids having to use excessive escaping.
Quoting from the Python Regular Expression HOWTO:
The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with
'r'
, sor"\n"
is a two-character string containing'\'
and'n'
, while"\n"
is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.
The \.
combination has no special meaning in regular python strings, so there is no difference, at all between the result of '\.'
and r'\.'
; you can use either:
>>> len('\.')
2
>>> len(r'\.')
2
Raw strings only make a difference when the backslash + other characters do have special meaning in regular string notation:
>>> '\b'
'\x08'
>>> r'\b'
'\\b'
>>> len('\b')
1
>>> len(r'\b')
2
The \b
combination has special meaning; in a regular string it is interpreted as the backspace character. But regular expressions see \b
as a word boundary anchor, so you'd have to use \\b
in your Python string every time you wanted to use this in a regular expression. Using r'\b'
instead makes it much easier to read and write your expressions.
The regular expression functions are passed string values; the result of Python interpreting your string literal. The functions do not know if you used raw or normal string literal syntax.
Upvotes: 6