Reputation: 35
I'm trying to apply preprocessing techniques on Arabic string list but I'm not getting the correct results.
This is my code:
import re
import sys
import itertools
from nltk.tokenize import TweetTokenizer
from nltk.stem.isri import ISRIStemmer
foo = 'السـلاام عــليكم 32 هذه تجّربة'
TATWEEL = u"\u0640"
stemmer = ISRIStemmer()
tknzr = TweetTokenizer()
text = tknzr.tokenize(foo)
for index in text:
newList = [i for i in text if not i.isdigit()] # Remove digit
newList = ' '.join([i.lower() for i in text if not i.startswith(('@', '#'))]) # Remove mentions and hashtags
newList = re.sub(r"http\S+", "",index) # Remove links
newList = stemmer.norm(index, num=1) # #emove diacritics
newList = re.sub(r'[^\w\s]','', index) # Remove punctuation
newList = index.replace(TATWEEL, '')
newList = ''.join(i for i, _ in itertools.groupby(index)) # Remove consecutive duplicate
print (newList)
The list I should get is:
السلام عليكم هذه تجربة
but What I got is:
ربة
When I try to test each method alone it works but when I gather it together it messes up.
-I'm using Python 3
Thank you.
Upvotes: 2
Views: 845
Reputation: 286
there is a specific package for rtl languages such as Arabic named Hazm.It has modified nltk to be compatible with rtl languages. Here's the link Hazm.
Upvotes: 0
Reputation: 87134
The value that you are seeing is the last item in list text
. All preceding items are lost because they are not being stored anywhere.
Furthermore, the sequence of operations in the body of the for loop are assigning a value to newList
, however, newList
is not referenced in subsequent operations, so any cumulative effect is lost.
To solve the first problem you can create a new empty list before the for loop to which items are appended as they are processed. This would be the final result list.
The second problem would be solved by referencing index
in each step and to assign the result back to index
.
Here is a solution:
import re
import sys
import itertools
from nltk.tokenize import TweetTokenizer
from nltk.stem.isri import ISRIStemmer
foo = 'ﺎﻠﺴـﻻﺎﻣ ﻊــﻠﻴﻜﻣ 32 ﻩﺬﻫ ﺖﺟّﺮﺑﺓ'
TATWEEL = u"\u0640"
stemmer = ISRIStemmer()
tknzr = TweetTokenizer()
text = tknzr.tokenize(foo)
result = [] # cleaned strings are stored here
for word in text:
if word.startswith(('@', '#')): # filter out hashtags
continue
word = word.lower()
word = ''.join([i for i in word if not i.isdigit()]) # Remove digits
word = re.sub(r"http\S+", "",word) # Remove links
word = stemmer.norm(word, num=1) # #emove diacritics
word = re.sub(r'[^\w\s]','', word) # Remove punctuation
word = word.replace(TATWEEL, '')
word = ''.join(i for i, _ in itertools.groupby(word)) # Remove consecutive duplicate
if word:
result.append(word)
print(' '.join(result))
Output
ﺎﻠﺴﻻﺎﻣ ﻊﻠﻴﻜﻣ ﻩﺬﻫ ﺖﺟ ﺮﺑﺓ
Upvotes: 0