Mai
Mai

Reputation: 131

Arabic Dataset Cleaning: Removing everything but Arabic text

I have a huge dataset in the Arabic language, I cleaned the data from special characters, English characters. But, I discovered that the dataset contains many other languages like Chinese, Japanese, Russian, etc. The problem is that I can't tell exactly what other languages are there mixed with the Arabic language, so I need a solution to remove everything in the text rather than Arabic characters from a pandas data frame. here is my code:

def clean_txt(input_str):
    try:
        if input_str: # if the input string is not empty do the following
              
            input_str = re.sub('[?؟!@#$%&*+~\/=><]+^' , '' , input_str) # Remove some of special chars 
            input_str=re.sub(r'[a-zA-Z?]', '', input_str).strip() # remove english chars 
            input_str = re.sub('[\\s]+'," ",input_str) # Remove all spaces
            input_str = input_str.replace("_" , ' ') #Remove underscore
            input_str = input_str.replace("ـ" , '') # Remove Arabic tatwelah
            input_str =input_str.replace('"','')# Remove "
            input_str =input_str.replace("''",'')# Remove ''
            input_str =input_str.replace("'",'')# Remove '
            input_str =input_str.replace(".",'')# Remove .
            input_str =input_str.replace(",",'')# Remove ,
            input_str =input_str.replace(":",' ')# Remove :
            input_str=re.sub(r" ?\([^)]+\)", "", str(input_str))  #Remove text between ()
            input_str = input_str.strip() # Trim input string

    except:
        return input_str
    return input_str

Sample-data

Upvotes: 0

Views: 1072

Answers (3)

Mai
Mai

Reputation: 131

Finally, I found the answer:

text ='大谷育江 صباح الخيرfff :"""%#$@&!~(2009 مرحباً Добро пожаловать fffff أحمــــد ݓ'


t = re.sub(r'[^0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD]+', ' ', text)
t
' صباح الخير 2009 مرحباً أحمــــد ݓ'

Upvotes: 1

Joop Eggen
Joop Eggen

Reputation: 109613

input_str = re.sub(r'[^ \\p{Arabic}]', '', input_str)

All those not-space and not-Arabic are removed. You might add interpunction, would need to take care of empties, like () but you could look into Unicode script/category names.


Corrected Instead of InArabic it should be Arabic, see Unicode scripts.

Upvotes: 0

J_H
J_H

Reputation: 20550

Language detection is a solved problem.

Simplest algorithmic approach is to scan a bunch of single-language texts for character bi-grams, and compute distance between those and the bi-gram frequency of target text.

Simplest thing for you to implement is to call into this NLTK routine:

from nltk.classify.textcat import TextCat

nltk.download(['crubadan', 'punkt'])
tc = TextCat()

>>> tc.guess_language('Now is the time for all good men to come to the aid of their party.')
'eng'
>>> tc.guess_language('Il est maintenant temps pour tous les hommes de bien de venir en aide à leur parti.')
'fra'
>>> tc.guess_language('لقد حان الوقت الآن لجميع الرجال الطيبين لمساعدة حزبهم.')
'arb'

Upvotes: -1

Related Questions