Bashar
Bashar

Reputation: 71

Remove symbols in dataset

I applied all preprocessing step, but I want to delete the rows that have English words or specific symbols, just i want words in the Arabic language without these symbols or English words that I mention it in below code. I applied the code, but when I print the dataset after cleaning, it still without cleaning! i want to remove it not replace it.

lexicon = pd.read_csv(r"C:\Users\User\Python Code\data.csv")
lexicon.head(10)

#output
    Vocabulary
0   [PAD]
1   [UNK]
2   [CLS]
3   [SEP]
4   [MASK]
5   !
6   #
7   $
8   %
9   &

lexicon['clean_tweet'] = lexicon.Vocabulary.str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE) #removes emojis
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('@[_A-Za-z0-9]+', '') #removes handles
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('[A-Za-z0-9]+', '') #removes english
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('#',' ') #removes hashtag symbol only
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True) #removes links
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('\d+', '') #removes numbers
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('\n', ' ') #removes new line
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('_', '') #removes underscore
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('[^\w\s]','') #removes punctuation
lexicon.head(10)

# Vocabulary    clean_tweet
0   [PAD]   
1   [UNK]   
2   [CLS]   
3   [SEP]   
4   [MASK]  
5   !   
6   #   
7   $   
8   %   
9   &   

I want to remove all rows that contain these symbols or any language, just I need arabic word, or is there another simple way to detect the Arabic words only?

note: if the row contains Arabic words and symbols, just I want to delete symbols without Arabic words.

Upvotes: 2

Views: 581

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521989

Going by this SO answer, a Unicode regex range for Arabic letters is:

[\u0627-\u064a]

We can try using the negative version of this character class along with str.replace:

lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace(r'[^\u0627-\u064a]', '')

If you want to spare whitespace characters or other punctuation symbols, then you could try using this regex:

[^\u0627-\u064a\s!?.-]

Upvotes: 1

Related Questions