Reputation: 48416
How can I split by word boundary in a regex engine that doesn't support it?
python's re can match on \b but doesn't seem to support splitting on it. I seem to recall dealing with other regex engines that had the same limitation.
example input:
"hello, foo"
expected output:
['hello', ', ', 'foo']
actual python output:
>>> re.compile(r'\b').split('hello, foo')
['hello, foo']
Upvotes: 5
Views: 764
Reputation: 17004
One can also use re.findall() for this:
>>> re.findall(r'.+?\b', 'hello, foo')
['hello', ', ', 'foo']
Upvotes: 2
Reputation: 41142
Interesting. So far most RE engines I tried do this split.
I played a bit and found that re.compile(r'(\W+)').split('hello, foo')
is giving the output you expected... Not sure if that's reliable, though.
Upvotes: 0
Reputation: 48416
Ok I figured it out:
Put the split pattern in capturing parens and will be included in the output. You can use either \w+ or \W+:
>>> re.compile(r'(\w+)').split('hello, foo')
['', 'hello', ', ', 'foo', '']
To get rid of the empty results, pass it through filter() with None as the filter function, which will filter anything that doesn't evaluate to true:
>>> filter(None, re.compile(r'(\w+)').split('hello, foo'))
['hello', ', ', 'foo']
Edit: CMS points out that if you use \W+ you don't need to use filter()
Upvotes: 1
Reputation: 827476
(\W+) can give you the expected output:
>>> re.compile(r'(\W+)').split('hello, foo')
['hello', ', ', 'foo']
Upvotes: 11
Reputation: 78548
Try
>>> re.compile(r'\W\b').split('hello, foo')
['hello,', 'foo']
This splits at the non-word characted before a boundry. Your example has nothing to split on.
Upvotes: 0