object has no attribute when removing stop words with NLTK

Question

I am trying to remove stopwords from the stopwords collection of NLTK from a pandas DataFrame that consists of rows of text data in Python 3:

import pandas as pd
from nltk.corpus import stopwords

file_path = '/users/rashid/desktop/webtext.csv'
doc = pd.read_csv(file_path, encoding = "ISO-8859-1")
texts = doc['text']
filter = texts != ""
dfNew = texts[filter]

stop = stopwords.words('english')
dfNew.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

I am getting this error:

'float' object has no attribute 'split'

alexis · Accepted Answer

Sounds like you have some numbers in your texts and they are causing pandas to get a little too smart. Add the dtype option to pandas.read_csv() to ensure that everything in the column text is imported as a string:

doc = pd.read_csv(file_path, encoding = "ISO-8859-1", dtype={'text':str})

Once you get your code working, you might notice it is slow: Looking things up in a list is inefficient. Put your stopwords in a set like this, and you'll be amazed at the speedup. (The in operator works with both sets and lists, but has a huge difference in speed.)

stop = set(stopwords.words('english'))

Finally, change x.split() to nltk.word_tokenize(x). If your data contains real text, this will separate punctuation from words and allow you to match stopwords properly.

object has no attribute when removing stop words with NLTK

Answers (1)

Related Questions