Reputation: 969
I have been trying to remove stopwords from a csv file that im reading using python code but my code does not seem to work. I have tried using a sample text in the code to validate my code but it is still the same . Below is my code and i would appreciate if anyone can help me rectify the issue.. here is the code below
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import csv
article = ['The computer code has a little bug' ,
'im learning python' ,
'thanks for helping me' ,
'this is trouble' ,
'this is a sample sentence'
'cat in the hat']
tokenized_models = [word_tokenize(str(i)) for i in article]
stopset = set(stopwords.words('english'))
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
print('token:'+str(stop_models))
Upvotes: 0
Views: 3834
Reputation: 5114
Your tokenized_models
is a list of tokenized sentences, so a list of lists. Ergo, the following line tries to match a list of words to a stopword:
stop_models = [i for i in tokenized_models if str(i).lower() not in stopset]
Instead, iterate again through words. Something like:
clean_models = []
for m in tokenized_models:
stop_m = [i for i in m if str(i).lower() not in stopset]
clean_models.append(stop_m)
print(clean_models)
Off-topic useful hint:
To define a multi-line string, use brackets and no comma:
article = ('The computer code has a little bug'
'im learning python'
'thanks for helping me'
'this is trouble'
'this is a sample sentence'
'cat in the hat')
This version would work with your original code
Upvotes: 3
Reputation: 36033
word_tokenize(str(i))
returns a list of words, so tokenized_models
is a list of lists. You need to flatten that list, or better yet just make article
a single string, since I don't see why it's a list at the moment.
This is because the in
operator won't search through a list and then through strings in that list at the same time, e.g.:
>>> 'a' in 'abc'
True
>>> 'a' in ['abc']
False
Upvotes: 0