Reputation: 2229
I need to extract the words and phrases within a text. For example, the text is:
Привет, hello, как дела? english word, еще одно русское слово, слово-1224, тест 4456
And script should return the following:
Привет
как
дела
еще
одно
русское
слово
слово-1224
That is, I need to take from the text of all the words that begin with the Russian letters ([а-яА-Яё-]
), and can contain numbers and letters of the Russian alphabet. How is this implemented?
Upvotes: 0
Views: 1377
Reputation: 5901
It was a little bit trickier than I thought. Have never used cyrrilic chars. I do believe this should do:
text = # Set you're input unicode string here.
words = re.findall('[\p{IsCyrillic}][0-9\p{IsCyrillic}]+', text)
for word in words:
print word
Upvotes: 1