Dave
Dave

Reputation: 564

Remove English words from Arabic string

Given a string containing a mixture of Arabic and English, I want to remove any English char or word from it, leaving only an Arabic sentence. The following code doesn't work. How can I modify it?

import string

text = 'انا أحاول أن أعرف من انت this is not'
maintext = ''.join(ch for ch in text if ch not in set(string.punctuation))
text = filter(lambda x: x==' ' or x not in string.printable , maintext)
print(text)

Thank you

Upvotes: 1

Views: 2869

Answers (3)

Chris Doyle
Chris Doyle

Reputation: 11968

All the other answers suggest using REGEX, but you can do this without regex and just the ascii letters from string module

import string

text = 'انا أحاول أن أعرف من انت this is not'
text = "".join([char for char in text if char not in string.ascii_letters]).strip()
print(text)

OUTPUT

انا أحاول أن أعرف من انت

Upvotes: 0

Shakir Zareen
Shakir Zareen

Reputation: 240

Here is my version:

import string
import re

text = 'انا أحاول أن أعرف من انت this is not'
maintext = re.sub(r'[a-zA-Z]', '', text)
print(maintext)

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520928

You could try using re.sub here:

# -*- coding: utf-8 -*-
import re

text = 'انا أحاول أن أعرف من انت this is not'
output = re.sub(r'\s*[A-Za-z]+\b', '' , text)
output = output.rstrip()
print(output)

This prints:

انا أحاول أن أعرف من انت

As a side note, we capture possible leading whitespace in the regex pattern \s*[A-Za-z]+, because we don't want to cause two Arabic words which surrounded an English word to become fused together. But, this leaves the possibility of trailing whitespace on the RHS, so we call rstrip() to remove it.

Upvotes: 1

Related Questions