andryan86
andryan86

Reputation: 11

How to remove specific string on list from dataframe

I have a list of stopwords and want to remove the words in the list from my dataframe.
my dataframe is like this:

data['review_normalized']    
['bagus']    
['sangat', 'baik']    
['oke']    
['setiap', 'mau', 'meeeting', 'lah']    

My code is:

list_stopwords = (["yang", "rt", "dengan", "nya", "di", 
                       'kalo', 'amp', 'biar', 'bikin', 'bilang', 
                       'enggak', 'karena', 'nya', 'nih', 'sih', 
                       'si', 'tau', 'tidak', 'tuh', 'untuk', 'ya', 
                       'jadi', 'jangan', 'sudah', 'aja', 'saja', 't', 
                       'nyg', 'hehe', 'pengen', 'nan', 'loh', 'rt',
                       '&amp', 'yah', 'ah', 'akh', 'deh', 'doang', 
                       'eh', 'ges', 'lah', 'lho', 'dek', 'bang', 'ges', 
                       'gan', 'aduh', 'meng'])    

# convert list to dictionary    
list_stopwords = set(list_stopwords)    

#remove stopword pada list token    
def stopwords_removal(words):    
    return [word for word in words if word not in list_stopwords]    

data['review_tokens_WSW'] = data['review_normalized'].apply(stopwords_removal)     

print(data['review_tokens_WSW'].head())    

But the output is like this:

0                          [[, ', b, a, g, u, s, ', ]]    
1    [[, ', s, a, n, g, a, ', ,,  , ', b, a, i, k, ...    
2                                [[, ', o, k, e, ', ]]    
3                                [[, ', o, k, e, ', ]]    
4    [[, ', s, e, i, a, p, ', ,,  , ', m, a, u, ', ...    

I want to be separated by words, so it should be like this:

0    ['bagus']    
1    ['sangat', 'baik']    
2    ['oke']    
4    ['setiap', 'mau', 'meeeting']    

Upvotes: 1

Views: 37

Answers (1)

jezrael
jezrael

Reputation: 862406

In column are strings, for convert them to lists use ast.literal_eval:

import ast

# convert list to dictionary    
list_stopwords = set(list_stopwords)    

#remove stopword pada strings converted to lists token    
def stopwords_removal(words):    
    return [word for word in ast.literal_eval(words) if word not in list_stopwords]     

data['review_tokens_WSW'] = data['review_normalized'].apply(stopwords_removal)     

print(data['review_tokens_WSW'].head())    
0                    [bagus]
1             [sangat, baik]
2                      [oke]
3    [setiap, mau, meeeting]
Name: review_tokens_WSW, dtype: object

Upvotes: 1

Related Questions