Tim
Tim

Reputation: 99576

metacharacters with slash inside brackets

From https://docs.python.org/2/howto/regex.html, I learned that the backslash is not necessary:

Metacharacters are not active inside classes. For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.

But `

(The predefined sets of characters) can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or ',' or '.'.

So I wonder how to understand the above two distinct statements, which seems to give contrary advice on when \ will work inside []? Thanks.

Upvotes: 1

Views: 413

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1124858

\ will only work inside a character class if it defines a predefined character set, unless you escape it for the meta character that it is by doubling it. Duh.

But yes, the first statement glosses over this a little too easily.

Technically speaking, \s, \w, etc. are not meta-characters. They are pre-defined character class sets, so the definition still holds. Neither is the backslash; it defines the start of an escape sequence instead. The proper way to escape an escape sequence, even in character classes, is to double the backslash.

Note that any escape sequence that is not recognised results in the re pattern to contain one character, the ineffective backslash is ignored and only the next character is used. \C is not a known character class, so the pattern contains the character C at that point.

There are metacharacters that do consist of an escape sequence, such as \A, \Z, and \B; these are just regular A, Z, and B characters when used in a character class. \b is special; in a character class it is a backspace character, just like in Python string literals (analogues to how \n, \t, \r, \a and \v are interpreted the same as in string literals).

Demo:

>>> import re
>>> re.findall(r'[\\]', r'\ the backslash will match')
['\\']
>>> re.findall(r'[\C]', r'\C is not a valid escape sequence, only Cs will match')
['C', 'C']
>>> re.findall(r'[\s]', r'No s will match, whitespace is matched instead')
[' ', ' ', ' ', ' ', ' ', ' ', ' ']

Upvotes: 3

zx81
zx81

Reputation: 41848

So I wonder how to understand the above two distinct statements

Here is how I understand these statements (and any two seemingly contradictory statemets in documentation):

Never assume that documentation writers have perfect logic.
(Unlike the creatures of logic puzzles.)

For instance, whether or not a single \ should be considered a metacharacter (and I agree with you that it should), here is another example of a character that is not stripped of its special nature inside a Python character class: ]. In [abc]], the first ] has not lost its special nature. It closes the character class. In fact, the final ] has lost its special nature, because it is outside a character class but now matches a literal ]. On the other hand, in []abc], the first ] has lost its special nature. Of course there are different ways to argue that point too. :)

Upvotes: 1

Related Questions