Reputation: 564
Given a string containing a mixture of Arabic and English, I want to remove any English char or word from it, leaving only an Arabic sentence. The following code doesn't work. How can I modify it?
import string
text = 'انا أحاول أن أعرف من انت this is not'
maintext = ''.join(ch for ch in text if ch not in set(string.punctuation))
text = filter(lambda x: x==' ' or x not in string.printable , maintext)
print(text)
Thank you
Upvotes: 1
Views: 2869
Reputation: 11968
All the other answers suggest using REGEX, but you can do this without regex and just the ascii letters from string module
import string
text = 'انا أحاول أن أعرف من انت this is not'
text = "".join([char for char in text if char not in string.ascii_letters]).strip()
print(text)
OUTPUT
انا أحاول أن أعرف من انت
Upvotes: 0
Reputation: 240
Here is my version:
import string
import re
text = 'انا أحاول أن أعرف من انت this is not'
maintext = re.sub(r'[a-zA-Z]', '', text)
print(maintext)
Upvotes: 0
Reputation: 520928
You could try using re.sub
here:
# -*- coding: utf-8 -*-
import re
text = 'انا أحاول أن أعرف من انت this is not'
output = re.sub(r'\s*[A-Za-z]+\b', '' , text)
output = output.rstrip()
print(output)
This prints:
انا أحاول أن أعرف من انت
As a side note, we capture possible leading whitespace in the regex pattern \s*[A-Za-z]+
, because we don't want to cause two Arabic words which surrounded an English word to become fused together. But, this leaves the possibility of trailing whitespace on the RHS, so we call rstrip()
to remove it.
Upvotes: 1