Reputation: 1685
In the book Programming Collective Intelligence there is a regular expression,
splitter = re.compile('\\W*')
From context it looks like this matches any non-alphanumeric character. But I am confused because it seems like it matches a backslash, then one or more W's. What does it really match?
Upvotes: 4
Views: 304
Reputation: 8610
\
is an escape character in regex. From left to right the \\
means \
and then \w*
, so it means matchs any nonaplhanumerical plus underscore characters. In this case if you want a \
, you have to write \\\\
. If you want the regex to be more clear and simple, you can use r'\W*'
. The r
means raw string, and can let you write less \
.
Upvotes: 1
Reputation: 213261
Your regex is equivalent to \W*
. It matches 0 or more non-alphanumeric characters.
Actually, you are using python string literal, instead of raw string. In a python string literal, to match a literal backslash, you need to escape the backslash - \\
, as a backslash has a special meaning there. And then for regex, you need to escape both the backslashes, to make it - \\\\
.
So, to match \
followed by 0 or more W
, you would need \\\\W*
in a string literal. You can simplify this by using a raw string. Where a \\
will match a literal \
. That's because, backslashes are not handled in any special way when used inside a raw string.
The below example will help you understand this:
>>> s = "\WWWW$$$$"
# Without raw string
>>> splitter = re.compile('\\W*') # Match non-alphanumeric characters
>>> re.findall(splitter, s)
['\\', '', '', '', '', '$$$$', '']
>>> splitter = re.compile('\\\\W*') # Match `\` followed by 0 or more `W`
>>> re.findall(splitter, s)
['\\WWWW']
# With raw string
>>> splitter = re.compile(r'\W*') # Same as first one. You need a single `\`
>>> re.findall(splitter, s)
['\\', '', '', '', '', '$$$$', '']
>>> splitter = re.compile(r'\\W*') # Same as 2nd. Two `\\` needed.
>>> re.findall(splitter, s)
['\\WWWW']
Upvotes: 3
Reputation: 27792
What happens is that the \
helps to escape characters. So \\
means \
. So your regex becomes (after escaping):
\W*
A better alternative is to use: r'\W*'
Upvotes: 0
Reputation: 236014
The first backslash is there just as an escape character, for programming languages that don't have a good string representation of regular expressions (for example: Java). In Python you can do better, this is equivalent:
r'\W*'
Notice the r
at the beginning (a raw string), that renders unnecessary the use of the first \
escape character. The second \
is unavoidable, that's part of the character class \W
Upvotes: 2
Reputation: 409
This matches non word characters, meaning not letters digits or underscores. This compiles into \W which is the negated version of \w where \w matches any word character.
So you are correct in your thought that it matches a non alpha-numeric.
For reference on special regex chars you can look here. http://www.regular-expressions.info/reference.html
Upvotes: 0
Reputation: 6606
That regexp will match a backslash and zero or more W's. If you want to match zero or more non-word characters:
splitter = re.compile(r'\W*')
Upvotes: -1