Reputation: 96
I'd like to clean the string from any words, which does not contain at least one Cyrillic letter (by words I mean parts of string split by whitespace char)
I've tried line = re.sub(' *^[^а-яА-Я]+ *', ' ', line)
where [а-яА-Я]
is set of cyrrilic letters, but when processing string
des поместья, de la famille Buonaparte. Non, je vous préviens que si vous
it returns
поместья, de la famille Buonaparte. Non, je vous préviens que si vous
instead of оf just
поместья
Upvotes: 0
Views: 991
Reputation: 627100
You want to keep any non-whitespace chunks that contain at least one Cyrillic char in them.
You can str.split()
the string and use unicodedata
to check if at least one char is Cyrillic, and only keep those "words":
import unicodedata as ud
result = ' '.join([word for word in text.split() if any('CYRILLIC' in ud.name(c) for c in word)])
print(result) # => поместья,
If you also need to strip any punctuation use any of the solutions from Best way to strip punctuation from a string:
import string
result = ' '.join([word.translate(str.maketrans('', '', string.punctuation)) for word in text.split() if any('CYRILLIC' in ud.name(c) for c in word)])
print(result) # => поместья
See the Python demo online. Details:
[word.translate(str.maketrans('', '', string.punctuation)) for word in text.split() if any('CYRILLIC' in ud.name(c) for c in word)]
- a list comprehension that
text.split()
splits the text
into non-whitespace chunksif any('CYRILLIC' in ud.name(c) for c in word)
- condition checking if the word
contains at least one Cyrillic charword.translate(str.maketrans('', '', string.punctuation))
- takes the word
if condition above is True and strips punctuation from it' '.join(...)
- joins the list items into a single space-separate string.Upvotes: 0
Reputation: 163477
One option is to match 1 or more occurrences of characters that are not in the range а-яА-Я and also exclude matching whitespace characters adding [^а-яА-Я\s]+
The negative lookarounds (?<!\S)
and (?!\S)
assert whitespace boundaries to the left and to the right.
When replacing with an empty string, there could be double spaced gaps, that you would have to replace with a single space.
If you don't want to match the trailing comma, you can use strip and add the characters that you want to remove.
See a regex demo for the matches.
For example:
import re
s = " des поместья, de la famille Buonaparte. Non, je vous préviens que si vous"
pattern = r"(?<!\S)[^а-яА-Я\s]+(?!\S)"
print(re.sub(pattern, "", s).strip(', '))
Output
поместья
Upvotes: 1