Reputation: 804
I am trying to remove stopwords from the stopwords collection of NLTK from a pandas DataFrame that consists of rows of text data in Python 3:
import pandas as pd
from nltk.corpus import stopwords
file_path = '/users/rashid/desktop/webtext.csv'
doc = pd.read_csv(file_path, encoding = "ISO-8859-1")
texts = doc['text']
filter = texts != ""
dfNew = texts[filter]
stop = stopwords.words('english')
dfNew.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
I am getting this error:
'float' object has no attribute 'split'
Upvotes: 2
Views: 1499
Reputation: 50220
Sounds like you have some numbers in your texts and they are causing pandas to get a little too smart. Add the dtype
option to pandas.read_csv()
to ensure that everything in the column text
is imported as a string:
doc = pd.read_csv(file_path, encoding = "ISO-8859-1", dtype={'text':str})
Once you get your code working, you might notice it is slow: Looking things up in a list is inefficient. Put your stopwords in a set like this, and you'll be amazed at the speedup. (The in
operator works with both sets and lists, but has a huge difference in speed.)
stop = set(stopwords.words('english'))
Finally, change x.split()
to nltk.word_tokenize(x)
. If your data contains real text, this will separate punctuation from words and allow you to match stopwords properly.
Upvotes: 3