Roman Nazarkin
Roman Nazarkin

Reputation: 2229

How to extract words from a text using python?

I need to extract the words and phrases within a text. For example, the text is:

Привет, hello, как дела? english word, еще одно русское слово, слово-1224, тест 4456

And script should return the following:

Привет
как
дела
еще
одно
русское
слово
слово-1224

That is, I need to take from the text of all the words that begin with the Russian letters ([а-яА-Яё-]), and can contain numbers and letters of the Russian alphabet. How is this implemented?

Upvotes: 0

Views: 1377

Answers (1)

Niclas Nilsson
Niclas Nilsson

Reputation: 5901

It was a little bit trickier than I thought. Have never used cyrrilic chars. I do believe this should do:

text =  # Set you're input unicode string here.
words = re.findall('[\p{IsCyrillic}][0-9\p{IsCyrillic}]+', text)

for word in words:
    print word

Upvotes: 1

Related Questions