Reputation: 99576
From https://docs.python.org/2/howto/regex.html, I learned that the backslash is not necessary:
Metacharacters are not active inside classes. For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.
But `
(The predefined sets of characters) can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or ',' or '.'.
So I wonder how to understand the above two distinct statements, which seems to give contrary advice on when \
will work inside []
? Thanks.
Upvotes: 1
Views: 413
Reputation: 1124858
\
will only work inside a character class if it defines a predefined character set, unless you escape it for the meta character that it is by doubling it. Duh.
But yes, the first statement glosses over this a little too easily.
Technically speaking, \s
, \w
, etc. are not meta-characters. They are pre-defined character class sets, so the definition still holds. Neither is the backslash; it defines the start of an escape sequence instead. The proper way to escape an escape sequence, even in character classes, is to double the backslash.
Note that any escape sequence that is not recognised results in the re
pattern to contain one character, the ineffective backslash is ignored and only the next character is used. \C
is not a known character class, so the pattern contains the character C
at that point.
There are metacharacters that do consist of an escape sequence, such as \A
, \Z
, and \B
; these are just regular A
, Z
, and B
characters when used in a character class. \b
is special; in a character class it is a backspace character, just like in Python string literals (analogues to how \n
, \t
, \r
, \a
and \v
are interpreted the same as in string literals).
Demo:
>>> import re
>>> re.findall(r'[\\]', r'\ the backslash will match')
['\\']
>>> re.findall(r'[\C]', r'\C is not a valid escape sequence, only Cs will match')
['C', 'C']
>>> re.findall(r'[\s]', r'No s will match, whitespace is matched instead')
[' ', ' ', ' ', ' ', ' ', ' ', ' ']
Upvotes: 3
Reputation: 41848
So I wonder how to understand the above two distinct statements
Here is how I understand these statements (and any two seemingly contradictory statemets in documentation):
Never assume that documentation writers have perfect logic.
(Unlike the creatures of logic puzzles.)
For instance, whether or not a single \
should be considered a metacharacter (and I agree with you that it should), here is another example of a character that is not stripped of its special nature
inside a Python character class: ]
. In [abc]]
, the first ]
has not lost its special nature. It closes the character class. In fact, the final ]
has lost its special nature, because it is outside a character class but now matches a literal ]
. On the other hand, in []abc]
, the first ]
has lost its special nature. Of course there are different ways to argue that point too. :)
Upvotes: 1