i love stackoverflow
i love stackoverflow

Reputation: 1685

Confused about regular expression

In the book Programming Collective Intelligence there is a regular expression,

splitter = re.compile('\\W*')

From context it looks like this matches any non-alphanumeric character. But I am confused because it seems like it matches a backslash, then one or more W's. What does it really match?

Upvotes: 4

Views: 304

Answers (6)

zhangyangyu
zhangyangyu

Reputation: 8610

\ is an escape character in regex. From left to right the \\ means \ and then \w*, so it means matchs any nonaplhanumerical plus underscore characters. In this case if you want a \, you have to write \\\\. If you want the regex to be more clear and simple, you can use r'\W*'. The r means raw string, and can let you write less \.

Upvotes: 1

Rohit Jain
Rohit Jain

Reputation: 213261

Your regex is equivalent to \W*. It matches 0 or more non-alphanumeric characters.

Actually, you are using python string literal, instead of raw string. In a python string literal, to match a literal backslash, you need to escape the backslash - \\, as a backslash has a special meaning there. And then for regex, you need to escape both the backslashes, to make it - \\\\.

So, to match \ followed by 0 or more W, you would need \\\\W* in a string literal. You can simplify this by using a raw string. Where a \\ will match a literal \. That's because, backslashes are not handled in any special way when used inside a raw string.

The below example will help you understand this:

>>> s = "\WWWW$$$$"

# Without raw string
>>> splitter = re.compile('\\W*')   # Match non-alphanumeric characters
>>> re.findall(splitter, s)
['\\', '', '', '', '', '$$$$', '']

>>> splitter = re.compile('\\\\W*') # Match `\` followed by 0 or more `W`
>>> re.findall(splitter, s)
['\\WWWW']

# With raw string
>>> splitter = re.compile(r'\W*')   # Same as first one. You need a single `\`
>>> re.findall(splitter, s)
['\\', '', '', '', '', '$$$$', '']

>>> splitter = re.compile(r'\\W*')  # Same as 2nd. Two `\\` needed.
>>> re.findall(splitter, s)
['\\WWWW']

Upvotes: 3

jh314
jh314

Reputation: 27792

What happens is that the \ helps to escape characters. So \\ means \. So your regex becomes (after escaping):

\W*

A better alternative is to use: r'\W*'

Upvotes: 0

Óscar López
Óscar López

Reputation: 236014

The first backslash is there just as an escape character, for programming languages that don't have a good string representation of regular expressions (for example: Java). In Python you can do better, this is equivalent:

r'\W*'

Notice the r at the beginning (a raw string), that renders unnecessary the use of the first \ escape character. The second \ is unavoidable, that's part of the character class \W

Upvotes: 2

NolanPower
NolanPower

Reputation: 409

This matches non word characters, meaning not letters digits or underscores. This compiles into \W which is the negated version of \w where \w matches any word character.

So you are correct in your thought that it matches a non alpha-numeric.

For reference on special regex chars you can look here. http://www.regular-expressions.info/reference.html

Upvotes: 0

mattexx
mattexx

Reputation: 6606

That regexp will match a backslash and zero or more W's. If you want to match zero or more non-word characters:

splitter = re.compile(r'\W*')

Upvotes: -1

Related Questions