dot
dot

Reputation: 87

Python - Regex cyrillic mixed with latin

I'm trying to extract the cyrillic letters from a mixed input but can't get it to split the way I want. No numbers or special characters involved.

input = "я я я я я w w w w w w\nф ф ф ф ф v v v v v v"
output = re.split("![а-я]\s*", input)
print(output)

I want to get rid of the w and v letters and just print the Russian ones. With my code, input and output are the same except that they're in a list now.

Upvotes: 2

Views: 1904

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627100

If you need to get all the Russian letters from your string, you need to use (?i)[А-ЯЁ] regex (do not forget about Ё as [А-Я] range does not include it) and use it with re.findall.

Tested in Python 3:

>>> import re
>>> input = "я я я я я w w w w w w\nф ф ф ф ф v v v v v v"
>>> output = re.findall(r'(?i)[А-ЯЁ]', input)
>>> print(output)
['я', 'я', 'я', 'я', 'я', 'ф', 'ф', 'ф', 'ф', 'ф']

To also extract Ukranian letters, you need to add ЇІЄҐ to the character class:

 output = re.findall(r"(?i)[А-ЯЁЇІЄҐ]", input)

An apostrophe is also considered a Ukrainan letter, no idea if you want to include it into the pattern.

Upvotes: 2

Related Questions