Reputation: 71
I applied all preprocessing step, but I want to delete the rows that have English words or specific symbols, just i want words in the Arabic language without these symbols or English words that I mention it in below code. I applied the code, but when I print the dataset after cleaning, it still without cleaning! i want to remove it not replace it.
lexicon = pd.read_csv(r"C:\Users\User\Python Code\data.csv")
lexicon.head(10)
#output
Vocabulary
0 [PAD]
1 [UNK]
2 [CLS]
3 [SEP]
4 [MASK]
5 !
6 #
7 $
8 %
9 &
lexicon['clean_tweet'] = lexicon.Vocabulary.str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE) #removes emojis
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('@[_A-Za-z0-9]+', '') #removes handles
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('[A-Za-z0-9]+', '') #removes english
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('#',' ') #removes hashtag symbol only
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True) #removes links
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('\d+', '') #removes numbers
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('\n', ' ') #removes new line
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('_', '') #removes underscore
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('[^\w\s]','') #removes punctuation
lexicon.head(10)
# Vocabulary clean_tweet
0 [PAD]
1 [UNK]
2 [CLS]
3 [SEP]
4 [MASK]
5 !
6 #
7 $
8 %
9 &
I want to remove all rows that contain these symbols or any language, just I need arabic word, or is there another simple way to detect the Arabic words only?
note: if the row contains Arabic words and symbols, just I want to delete symbols without Arabic words.
Upvotes: 2
Views: 581
Reputation: 521989
Going by this SO answer, a Unicode regex range for Arabic letters is:
[\u0627-\u064a]
We can try using the negative version of this character class along with str.replace
:
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace(r'[^\u0627-\u064a]', '')
If you want to spare whitespace characters or other punctuation symbols, then you could try using this regex:
[^\u0627-\u064a\s!?.-]
Upvotes: 1