Reputation: 11
I have a list of stopwords and want to remove the words in the list from my dataframe.
my dataframe is like this:
data['review_normalized']
['bagus']
['sangat', 'baik']
['oke']
['setiap', 'mau', 'meeeting', 'lah']
My code is:
list_stopwords = (["yang", "rt", "dengan", "nya", "di",
'kalo', 'amp', 'biar', 'bikin', 'bilang',
'enggak', 'karena', 'nya', 'nih', 'sih',
'si', 'tau', 'tidak', 'tuh', 'untuk', 'ya',
'jadi', 'jangan', 'sudah', 'aja', 'saja', 't',
'nyg', 'hehe', 'pengen', 'nan', 'loh', 'rt',
'&', 'yah', 'ah', 'akh', 'deh', 'doang',
'eh', 'ges', 'lah', 'lho', 'dek', 'bang', 'ges',
'gan', 'aduh', 'meng'])
# convert list to dictionary
list_stopwords = set(list_stopwords)
#remove stopword pada list token
def stopwords_removal(words):
return [word for word in words if word not in list_stopwords]
data['review_tokens_WSW'] = data['review_normalized'].apply(stopwords_removal)
print(data['review_tokens_WSW'].head())
But the output is like this:
0 [[, ', b, a, g, u, s, ', ]]
1 [[, ', s, a, n, g, a, ', ,, , ', b, a, i, k, ...
2 [[, ', o, k, e, ', ]]
3 [[, ', o, k, e, ', ]]
4 [[, ', s, e, i, a, p, ', ,, , ', m, a, u, ', ...
I want to be separated by words, so it should be like this:
0 ['bagus']
1 ['sangat', 'baik']
2 ['oke']
4 ['setiap', 'mau', 'meeeting']
Upvotes: 1
Views: 37
Reputation: 862406
In column are strings, for convert them to lists use ast.literal_eval
:
import ast
# convert list to dictionary
list_stopwords = set(list_stopwords)
#remove stopword pada strings converted to lists token
def stopwords_removal(words):
return [word for word in ast.literal_eval(words) if word not in list_stopwords]
data['review_tokens_WSW'] = data['review_normalized'].apply(stopwords_removal)
print(data['review_tokens_WSW'].head())
0 [bagus]
1 [sangat, baik]
2 [oke]
3 [setiap, mau, meeeting]
Name: review_tokens_WSW, dtype: object
Upvotes: 1