user1491915
user1491915

Reputation: 1085

NLTK SklearnClassifier error

I'm trying to classify text documents using NLTK's SklearnClassifier and MultinomialNB. This is the code:

pipeline = Pipeline([('tfidf', TfidfTransformer()),
                             ('chi2', SelectKBest(chi2, k=1000)),
                             ('nb', MultinomialNB())])
classifier = SklearnClassifier(pipeline)

test_skl = []
t_test_skl = []
for d in test_set:
    test_skl.append(d[0])
    t_test_skl.append(d[1])

p_class = classifier.batch_classify(test_skl)

print classification_report(t_test_skl, p_class, labels=list(set(t_test_skl)),target_names=cls_set)

And I'm getting this error:

Traceback (most recent call last):
  File "classify.py", line 72, in <module>
    p_class = classifier.batch_classify(test_skl)
  File "/Users/me/anaconda/lib/python2.7/site-packages/nltk-3.0a3-py2.7.egg/nltk/classify/scikitlearn.py", line 84, in batch_classify
    X = self._vectorizer.transform(featuresets)
  File "/Users/me/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 213, in transform
    vocab = self.vocabulary_
AttributeError: 'DictVectorizer' object has no attribute 'vocabulary_'

I'm using NLTK 3.0a3 and scikit-learn 0.14.1 .

Any clues?

Thanks

Upvotes: 2

Views: 1824

Answers (3)

Fred Foo
Fred Foo

Reputation: 363517

You haven't trained the classifier. Call its train method before attempting to classify anything. (As the author of this code, I admit the error message could be friendlier.)

Upvotes: 3

Abhishek Thakur
Abhishek Thakur

Reputation: 17005

change pipeline to :

pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                             ('chi2', SelectKBest(chi2, k=1000)),
                             ('nb', MultinomialNB())])

and then it should work

Upvotes: 0

wonderkid2
wonderkid2

Reputation: 4864

Your DictVectorizer object has no vocabulary - meaning it hasn't been fitted, or it has been fitted with an empty dataset.

You need to call the fit(X[, y]) method on the DictVectorizer with a usable dataset.

The vocabulary_ property is where the vectorizer stores the feature matrix after it has been fittet. No vocabulary - no usable vectorizer.

Upvotes: 1

Related Questions