Reputation: 21
when I apply this code: re.sub(r'\sو(\w+)', r' و \1', text) it deletes the letter "و" that locate at the front of the word. I just want to separate it from the word for example: " أكلت موزة وتفاحة في الحديقة " I want it to be: " أكلت موزة و تفاحة في الحديقة" but it be like this: " أكلت موزة تفاحة في الحديقة "
this is the code: class Arabic_preprocessing:
def __init__(self):
#preparing punctuations list
arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ'''
english_punctuations = string.punctuation
self.all_punctuations = set(arabic_punctuations + english_punctuations)
# initializing the stemmer
#self.stemmer = ARLSTem() # requires minimum NLTK version of 3.2.5
self.arabic_diacritics = re.compile("""
ّ | # Tashdid
َ | # Fatha
ً | # Tanwin Fath
ُ | # Damma
ٌ | # Tanwin Damm
ِ | # Kasra
ٍ | # Tanwin Kasr
ْ | # Sukun
ـ # Tatwil/Kashida
""", re.VERBOSE)
def normalize_arabic(self, text):
text = re.sub("[إأآاٱ]", "ا", text)
text = re.sub("ى", "ي", text)
#text = re.sub("ؤ", "ء", text)
#text = re.sub("ئ", "ء", text)
text = re.sub("ة", "ه", text) # replace ta2 marboota by ha2
text = re.sub("گ", "ك", text)
text = re.sub(r'\bال(\w\w+)', r'\1', text) # remove al ta3reef
text = re.sub(r'\sو(\w+)', r' و \1', text)
text = re.sub("\u0640", '', text) # remove tatweel
return text
Upvotes: 1
Views: 280
Reputation: 41
The problem is not with the regular expression.
I ran it in my python interpreter and it works fine.
Upvotes: 0