ykombinator
ykombinator

Reputation: 2754

Unable to remove english stopwords from a dataframe

I have been trying to perform sentiment analysis over a movie reviews dataset and I am stuck at a point where I am unable to remove english stopwords from the data. What am I doing wrong?

from nltk.corpus import stopwords
stop = stopwords.words("English")
list_ = []
for file_ in dataset:
    dataset['Content'] = dataset['Content'].apply(lambda x: [item for item in x.split(',') if item not in stop])
    list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)

Upvotes: 1

Views: 1386

Answers (4)

tvashtar
tvashtar

Reputation: 4315

You are looping over dataset, but appending the whole frame each time and not using file_ Try:

from nltk.corpus import stopwords
stop = stopwords.words("English")
dataset['Cleaned'] = dataset['Content'].apply(lambda x: ','.join([item for item in x.split(',') if item not in stop]))

That returns a Series containing lists of words, if you want to flatten that to a single list:

flat_list = [item for sublist in list(dataset['Cleaned'].values) for item in sublist]

With a hat tip to Making a flat list out of list of lists in Python

Upvotes: 0

niraj
niraj

Reputation: 18218

I think the code should work with information so far. The assumption I am making is with data has extra space while separated with comma. Below is the test ran: (hope it helps!)

import pandas as pd
from nltk.corpus import stopwords
import nltk

stop = nltk.corpus.stopwords.words('english')

dataset = pd.DataFrame([{'Content':'i, am, the, computer, machine'}])
dataset = dataset.append({'Content':'i, play, game'}, ignore_index=True)
print(dataset)
list_ = []
for file_ in dataset:
    dataset['Content'] = dataset['Content'].apply(lambda x: [item.strip() for item in x.split(',') if item.strip() not in stop])
    list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)

print(dataset)

Input with stopwords:

                          Content
0   i, am, the, computer, machine
1                   i, play, game

Output:

                Content
 0  [computer, machine]
 1         [play, game]

Upvotes: 1

alvas
alvas

Reputation: 122168

Try earthy:

>>> from earthy.wordlist import punctuations, stopwords
>>> from earthy.preprocessing import remove_stopwords
>>> result = dataset['Content'].apply(remove_stopwords)

See https://github.com/alvations/earthy/blob/master/FAQ.md#what-else-can-earthy-do

Upvotes: 0

user4280261
user4280261

Reputation:

Well through your comment I think that you don't need to loop over dataset. (Maybe dataset contains only the single column named Content)

You can simply do:

 dataset["Content"] = dataset["Content"].str.split(",").apply(lambda x: [item for item in x if item not in stop])

Upvotes: 0

Related Questions