Reputation: 11

using re.findall when in need of striping a string into words in python

I'm using re.findall like this:

x=re.findall('\w+', text)

so I'm getting a list of words matching the characters [a-zA-Z0-9]. the problem is when I'm using this input:

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~:

I want to get an empty list, but im getting ['', '']. how could I exclude those underscores?

Upvotes: 1

Answers (2)

John La Rooy

Reputation: 304375

You can use groupby for this too

from itertools import groupby
x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]

eg.

>>> text = 'The foo bar baz! And the eggs, ham and spam?'
>>> x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]
>>> x
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']

Upvotes: 0

Martijn Pieters

Reputation: 1123620

Use just the [a-zA-Z0-9] pattern; \w includes underscores:

x = re.findall('[a-zA-Z0-9]+', text)

or use the inverse of \w, \W in a negative character set with _ added:

x = re.findall('[^\W_]+', text)

The latter has the advantage of working correctly even when using re.UNICODE or re.LOCALE, where \w matches a wider range of characters.

Demo:

>>> import re
>>> text = '!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~:'
>>> re.findall('[^\W_]+', text)
[]
>>> re.findall('[^\W_]+', 'The foo bar baz! And the eggs, ham and spam?')
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']

Upvotes: 3

using re.findall when in need of striping a string into words in python

Answers (2)

Related Questions