helveticafire
helveticafire

Reputation: 35

Python - Should I be using string prefix r when looking for a period (full stop or .) using regex?

I would like to know the reason I get the same result when using string prefix "r" or not when looking for a period (full stop) using python regex.

After reading a number sources (Links below) a multiple times and experimenting with in code to find the same result (again see below), I am still unsure of:

  1. What is the difference when using string prefix "r" and not using string prefix "r", when looking for a period using regex?
  2. Which way is considered the correct way of finding a period in a string using python regex with string prefix "r" or without string prefix "r"?

re.compile("\.").sub("!", "blah.")

'blah!'

re.compile(r"\.").sub("!", "blah.")

'blah!'

re.compile(r"\.").search("blah.").group()

'.'

re.compile("\.").search("blah.").group()

'.'

Sources I have looked at:

Python docs: string literals http://docs.python.org/2/reference/lexical_analysis.html#string-literals

Regular expression to replace "escaped" characters with their originals

Python regex - r prefix

r prefix is for raw strings http://forums.udacity.com/questions/7000217/r-prefix-is-for-raw-strings

Upvotes: 3

Views: 892

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1123560

The raw string notation is just that, a notation to specify a string value. The notation results in different string values when it comes to backslash escapes recognized by the normal string notation. Because regular expressions also attach meaning to the backslash character, raw string notation is quite handy as it avoids having to use excessive escaping.

Quoting from the Python Regular Expression HOWTO:

The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.

The \. combination has no special meaning in regular python strings, so there is no difference, at all between the result of '\.' and r'\.'; you can use either:

>>> len('\.')
2
>>> len(r'\.')
2

Raw strings only make a difference when the backslash + other characters do have special meaning in regular string notation:

>>> '\b'
'\x08'
>>> r'\b'
'\\b'
>>> len('\b')
1
>>> len(r'\b')
2

The \b combination has special meaning; in a regular string it is interpreted as the backspace character. But regular expressions see \b as a word boundary anchor, so you'd have to use \\b in your Python string every time you wanted to use this in a regular expression. Using r'\b' instead makes it much easier to read and write your expressions.

The regular expression functions are passed string values; the result of Python interpreting your string literal. The functions do not know if you used raw or normal string literal syntax.

Upvotes: 6

Related Questions