Unicode Warning when using NLTK stopwords with TfidfVectorizer of scikit-learn

Question

I am trying to use the Tf-idf Vectorizer from scikit-learn, using the spanish stopwords from NLTK:

from nltk.corpus import stopwords

vectorizer = TfidfVectorizer(stop_words=stopwords.words("spanish"))

The problem is that I get the following warning:

/home/---/.virtualenvs/thesis/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py:122: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
tokens = [w for w in tokens if w not in stop_words]

Is there an easy way to solve this issue?

Unicode Warning when using NLTK stopwords with TfidfVectorizer of scikit-learn

Answers (1)

Related Questions