Deqing
Deqing

Reputation: 14632

Does * have side effect in Python regular expression matching?

I'm learning Python's regular expression, following is working as I expected:

>>> import re
>>> re.split('\s+|:', 'find   a str:s2')
['find', 'a', 'str', 's2']

But when I change + to *, the output is weird to me:

>>> re.split('\s*|:', 'find  a str:s2')
['find', 'a', 'str:s2']

How is such pattern interpreted in Python?

Upvotes: 4

Views: 127

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121614

The 'side effect' you are seeing is that re.split() will only split on matches that are longer than 0 characters.

The \s*|: pattern matches either on zero or more spaces, or on :, whichever comes first. But zero spaces matches everywhere. In those locations where more than zero spaces matched, the split is made.

Because the \s* pattern matches every time a character is considered for splitting, the next option : is never considered.

Splitting on non-empty matches is called out explicitly in the re.split() documentation:

Note that split will never split a string on an empty pattern match.

If you reverse the pattern, : is considered, as it is the first choice:

>>> re.split(':|\s*', 'find  a str:s2')
['find', 'a', 'str', 's2']

Upvotes: 9

nochkin
nochkin

Reputation: 710

If you meant to do "or" for your matching, then you have to do something like this: re.split('(\s*|:)', 'find a str:s2') In short: "+" means "at least one character". "*" any (or none)

Upvotes: -4

Related Questions