Reputation: 14632
I'm learning Python's regular expression, following is working as I expected:
>>> import re
>>> re.split('\s+|:', 'find a str:s2')
['find', 'a', 'str', 's2']
But when I change +
to *
, the output is weird to me:
>>> re.split('\s*|:', 'find a str:s2')
['find', 'a', 'str:s2']
How is such pattern interpreted in Python?
Upvotes: 4
Views: 127
Reputation: 1121614
The 'side effect' you are seeing is that re.split()
will only split on matches that are longer than 0 characters.
The \s*|:
pattern matches either on zero or more spaces, or on :
, whichever comes first. But zero spaces matches everywhere. In those locations where more than zero spaces matched, the split is made.
Because the \s*
pattern matches every time a character is considered for splitting, the next option :
is never considered.
Splitting on non-empty matches is called out explicitly in the re.split()
documentation:
Note that split will never split a string on an empty pattern match.
If you reverse the pattern, :
is considered, as it is the first choice:
>>> re.split(':|\s*', 'find a str:s2')
['find', 'a', 'str', 's2']
Upvotes: 9
Reputation: 710
If you meant to do "or" for your matching, then you have to do something like this:
re.split('(\s*|:)', 'find a str:s2')
In short:
"+" means "at least one character".
"*" any (or none)
Upvotes: -4