Reputation: 2754
I have been trying to perform sentiment analysis over a movie reviews dataset and I am stuck at a point where I am unable to remove english stopwords from the data. What am I doing wrong?
from nltk.corpus import stopwords
stop = stopwords.words("English")
list_ = []
for file_ in dataset:
dataset['Content'] = dataset['Content'].apply(lambda x: [item for item in x.split(',') if item not in stop])
list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)
Upvotes: 1
Views: 1386
Reputation: 4315
You are looping over dataset, but appending the whole frame each time and not using file_ Try:
from nltk.corpus import stopwords
stop = stopwords.words("English")
dataset['Cleaned'] = dataset['Content'].apply(lambda x: ','.join([item for item in x.split(',') if item not in stop]))
That returns a Series containing lists of words, if you want to flatten that to a single list:
flat_list = [item for sublist in list(dataset['Cleaned'].values) for item in sublist]
With a hat tip to Making a flat list out of list of lists in Python
Upvotes: 0
Reputation: 18218
I think the code should work with information so far. The assumption I am making is with data has extra space while separated with comma. Below is the test ran: (hope it helps!)
import pandas as pd
from nltk.corpus import stopwords
import nltk
stop = nltk.corpus.stopwords.words('english')
dataset = pd.DataFrame([{'Content':'i, am, the, computer, machine'}])
dataset = dataset.append({'Content':'i, play, game'}, ignore_index=True)
print(dataset)
list_ = []
for file_ in dataset:
dataset['Content'] = dataset['Content'].apply(lambda x: [item.strip() for item in x.split(',') if item.strip() not in stop])
list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)
print(dataset)
Input with stopwords:
Content
0 i, am, the, computer, machine
1 i, play, game
Output:
Content
0 [computer, machine]
1 [play, game]
Upvotes: 1
Reputation: 122168
Try earthy
:
>>> from earthy.wordlist import punctuations, stopwords
>>> from earthy.preprocessing import remove_stopwords
>>> result = dataset['Content'].apply(remove_stopwords)
See https://github.com/alvations/earthy/blob/master/FAQ.md#what-else-can-earthy-do
Upvotes: 0
Reputation:
Well through your comment I think that you don't need to loop over dataset
. (Maybe dataset
contains only the single column named Content
)
You can simply do:
dataset["Content"] = dataset["Content"].str.split(",").apply(lambda x: [item for item in x if item not in stop])
Upvotes: 0