IS Student
IS Student

Reputation: 35

Issue in Arabic preprocessing techniques

I'm trying to apply preprocessing techniques on Arabic string list but I'm not getting the correct results.

This is my code:

import re
import sys
import itertools
from nltk.tokenize import TweetTokenizer
from nltk.stem.isri import ISRIStemmer

foo = 'السـلاام عــليكم 32 هذه تجّربة'
TATWEEL = u"\u0640"
stemmer = ISRIStemmer()
tknzr = TweetTokenizer()
text = tknzr.tokenize(foo)

for index in text:
    newList = [i for i in text if not i.isdigit()] # Remove digit 
    newList = ' '.join([i.lower() for i in text if not i.startswith(('@', '#'))]) # Remove mentions and hashtags
    newList = re.sub(r"http\S+", "",index) # Remove links
    newList = stemmer.norm(index, num=1) # #emove diacritics
    newList = re.sub(r'[^\w\s]','', index)  # Remove punctuation
    newList = index.replace(TATWEEL, '')
    newList = ''.join(i for i, _ in itertools.groupby(index)) # Remove consecutive duplicate

print (newList)

The list I should get is:

السلام عليكم هذه تجربة

but What I got is:

ربة

When I try to test each method alone it works but when I gather it together it messes up.

-I'm using Python 3

Thank you.

Upvotes: 2

Views: 845

Answers (2)

A_emperio
A_emperio

Reputation: 286

there is a specific package for rtl languages such as Arabic named Hazm.It has modified nltk to be compatible with rtl languages. Here's the link Hazm.

Upvotes: 0

mhawke
mhawke

Reputation: 87134

The value that you are seeing is the last item in list text. All preceding items are lost because they are not being stored anywhere.

Furthermore, the sequence of operations in the body of the for loop are assigning a value to newList, however, newList is not referenced in subsequent operations, so any cumulative effect is lost.

To solve the first problem you can create a new empty list before the for loop to which items are appended as they are processed. This would be the final result list.

The second problem would be solved by referencing index in each step and to assign the result back to index.

Here is a solution:

import re
import sys
import itertools
from nltk.tokenize import TweetTokenizer
from nltk.stem.isri import ISRIStemmer

foo = 'ﺎﻠﺴـﻻﺎﻣ ﻊــﻠﻴﻜﻣ 32 ﻩﺬﻫ ﺖﺟّﺮﺑﺓ'
TATWEEL = u"\u0640"
stemmer = ISRIStemmer()
tknzr = TweetTokenizer()
text = tknzr.tokenize(foo)

result = []     # cleaned strings are stored here

for word in text:
    if word.startswith(('@', '#')):    # filter out hashtags
        continue
    word = word.lower()
    word = ''.join([i for i in word if not i.isdigit()]) # Remove digits
    word = re.sub(r"http\S+", "",word) # Remove links
    word = stemmer.norm(word, num=1) # #emove diacritics
    word = re.sub(r'[^\w\s]','', word)  # Remove punctuation
    word = word.replace(TATWEEL, '')
    word = ''.join(i for i, _ in itertools.groupby(word)) # Remove consecutive duplicate
    if word:
        result.append(word)

print(' '.join(result))

Output

ﺎﻠﺴﻻﺎﻣ ﻊﻠﻴﻜﻣ ﻩﺬﻫ ﺖﺟ ﺮﺑﺓ

Upvotes: 0

Related Questions