ʞɔıu
ʞɔıu

Reputation: 48416

Split by \b when your regex engine doesn't support it

How can I split by word boundary in a regex engine that doesn't support it?

python's re can match on \b but doesn't seem to support splitting on it. I seem to recall dealing with other regex engines that had the same limitation.

example input:

"hello, foo"

expected output:

['hello', ', ', 'foo']

actual python output:

>>> re.compile(r'\b').split('hello, foo')
['hello, foo']

Upvotes: 5

Views: 764

Answers (5)

PEZ
PEZ

Reputation: 17004

One can also use re.findall() for this:

>>> re.findall(r'.+?\b', 'hello, foo')
['hello', ', ', 'foo']

Upvotes: 2

PhiLho
PhiLho

Reputation: 41142

Interesting. So far most RE engines I tried do this split.

I played a bit and found that re.compile(r'(\W+)').split('hello, foo') is giving the output you expected... Not sure if that's reliable, though.

Upvotes: 0

ʞɔıu
ʞɔıu

Reputation: 48416

Ok I figured it out:

Put the split pattern in capturing parens and will be included in the output. You can use either \w+ or \W+:

>>> re.compile(r'(\w+)').split('hello, foo')
['', 'hello', ', ', 'foo', '']

To get rid of the empty results, pass it through filter() with None as the filter function, which will filter anything that doesn't evaluate to true:

>>> filter(None, re.compile(r'(\w+)').split('hello, foo'))
['hello', ', ', 'foo']

Edit: CMS points out that if you use \W+ you don't need to use filter()

Upvotes: 1

Christian C. Salvadó
Christian C. Salvadó

Reputation: 827476

(\W+) can give you the expected output:

>>> re.compile(r'(\W+)').split('hello, foo')
['hello', ', ', 'foo']

Upvotes: 11

gnud
gnud

Reputation: 78548

Try

>>> re.compile(r'\W\b').split('hello, foo')
['hello,', 'foo']

This splits at the non-word characted before a boundry. Your example has nothing to split on.

Upvotes: 0

Related Questions