Preprocessing Arabic text

Question

when I apply this code: re.sub(r'\sو(\w+)', r' و \1', text) it deletes the letter "و" that locate at the front of the word. I just want to separate it from the word for example: " أكلت موزة وتفاحة في الحديقة " I want it to be: " أكلت موزة و تفاحة في الحديقة" but it be like this: " أكلت موزة تفاحة في الحديقة "

this is the code: class Arabic_preprocessing:

def __init__(self):

    #preparing punctuations list
    arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ'''
    english_punctuations = string.punctuation
    self.all_punctuations = set(arabic_punctuations + english_punctuations)

    # initializing the stemmer
    #self.stemmer = ARLSTem()  # requires minimum NLTK version of 3.2.5

    self.arabic_diacritics = re.compile("""
                                     ّ    | # Tashdid
                                     َ    | # Fatha
                                     ً    | # Tanwin Fath
                                     ُ    | # Damma
                                     ٌ    | # Tanwin Damm
                                     ِ    | # Kasra
                                     ٍ    | # Tanwin Kasr
                                     ْ    | # Sukun
                                     ـ     # Tatwil/Kashida

                                 """, re.VERBOSE)


def normalize_arabic(self, text):
    text = re.sub("[إأآاٱ]", "ا", text)
    text = re.sub("ى", "ي", text)
    #text = re.sub("ؤ", "ء", text)
    #text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)  # replace ta2 marboota by ha2
    text = re.sub("گ", "ك", text)
    text = re.sub(r'\bال(\w\w+)', r'\1', text)    # remove al ta3reef
    text = re.sub(r'\sو(\w+)', r' و \1', text)

    text = re.sub("\u0640", '', text)  # remove tatweel
    return text

Preprocessing Arabic text

Answers (1)

Related Questions