c0nman
c0nman

Reputation: 309

How to use the regex module in python to split a string of text into the words only?

Here's what I'm working with…

string1 = "Dog,cat,mouse,bird. Human."

def string_count(text):
    text = re.split('\W+', text)
    count = 0
    for x in text:
        count += 1
        print count
        print x

return text

print string_count(string1)

…and here's the output…

1
Dog
2
cat
3
mouse
4
bird
5
Human
6

['Dog', 'cat', 'mouse', 'bird', 'Human', '']

Why am I getting a 6 even though there are only 5 words? I can't seem to get rid of the '' (empty string)! It's driving me insane.

Upvotes: 1

Views: 88

Answers (2)

Adam Smith
Adam Smith

Reputation: 54173

Avinash Raj correctly stated WHY it's doing that. Here's how to fix it:

string1 = "Dog,cat,mouse,bird. Human."
the_list = [word for word in re.split('\W+', string1) if word]
# include the word in the list if it's not the empty string

Or alternatively (and this is better...)

string1 = "Dog,cat,mouse,bird. Human."
the_list = re.findall('\w+', string1)
# find all words in string1

Upvotes: 1

Avinash Raj
Avinash Raj

Reputation: 174696

Because while it splits based on the last dot, it gives the last empty part also.

You splitted the input string based on \W+ which means split the input string based on one or more non-word character. So your regex matches the last dot also and splits the input based on the last dot also. Because of no string present after to the last dot, it returns an empty string after splitting.

Upvotes: 1

Related Questions