Reputation: 11
I'm using re.findall
like this:
x=re.findall('\w+', text)
so I'm getting a list of words matching the characters [a-zA-Z0-9]
.
the problem is when I'm using this input:
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~:
I want to get an empty list, but im getting ['', '']. how could I exclude those underscores?
Upvotes: 1
Views: 52
Reputation: 304375
You can use groupby for this too
from itertools import groupby
x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]
eg.
>>> text = 'The foo bar baz! And the eggs, ham and spam?'
>>> x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]
>>> x
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']
Upvotes: 0
Reputation: 1123620
Use just the [a-zA-Z0-9]
pattern; \w
includes underscores:
x = re.findall('[a-zA-Z0-9]+', text)
or use the inverse of \w
, \W
in a negative character set with _
added:
x = re.findall('[^\W_]+', text)
The latter has the advantage of working correctly even when using re.UNICODE
or re.LOCALE
, where \w
matches a wider range of characters.
Demo:
>>> import re
>>> text = '!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~:'
>>> re.findall('[^\W_]+', text)
[]
>>> re.findall('[^\W_]+', 'The foo bar baz! And the eggs, ham and spam?')
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']
Upvotes: 3