Reputation: 13
I'm trying to do some sentiment analysis of a new movie from Twitter using the NLTK toolkit. I've followed the NLTK 'movie_reviews' example and I've built my own CategorizedPlaintextCorpusReader object. The problem arises when I call nltk.classify.util.accuracy(classifier, testfeats)
. Here is the code:
import os
import glob
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
trainfeats = negfeats + posfeats
# Building a custom Corpus Reader
tweets = nltk.corpus.reader.CategorizedPlaintextCorpusReader('./tweets', r'.*\.txt', cat_pattern=r'(.*)\.txt')
tweetsids = tweets.fileids()
testfeats = [(word_feats(tweets.words(fileids=[f]))) for f in tweetsids]
print 'Training the classifier'
classifier = NaiveBayesClassifier.train(trainfeats)
for tweet in tweetsids:
print tweet + ' : ' + classifier.classify(word_feats(tweets.words(tweetsids)))
classifier.show_most_informative_features()
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
It all seems to work fine until it gets to the last line. That's when I get the error:
>>> nltk.classify.util.accuracy(classifier, testfeats)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs,l) in gold])
ValueError: too many values to unpack
Does anybody see anything wrong within the code?
Thanks.
Upvotes: 1
Views: 4455
Reputation: 879181
The error message
File "/usr/lib/python2.7/dist-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs,l) in gold])
ValueError: too many values to unpack
arises because items in gold
can not be unpacked into a 2-tuple, (fs,l)
:
[fs for (fs,l) in gold] # <-- The ValueError is raised here
It is the same error you would get if gold
equals [(1,2,3)]
, since the 3-tuple (1,2,3)
can not be unpacked into a 2-tuple (fs,l)
:
In [74]: [fs for (fs,l) in [(1,2)]]
Out[74]: [1]
In [73]: [fs for (fs,l) in [(1,2,3)]]
ValueError: too many values to unpack
gold
might be buried inside the implementation of nltk.classify.util.accuracy
, but this hints that your inputs, classifier
or testfeats
are of the wrong "shape".
There is no problem with classifer, since calling accuracy(classifier, trainfeats)
works:
In [61]: print 'accuracy:', nltk.classify.util.accuracy(classifier, trainfeats)
accuracy: 0.9675
The problem must be in testfeats
.
Compare trainfeats
with testfeats
.
trainfeats[0]
is a 2-tuple containing a dict and a classification:
In [63]: trainfeats[0]
Out[63]:
({u'!': True,
u'"': True,
u'&': True,
...
u'years': True,
u'you': True,
u'your': True},
'neg') # <--- Notice the classification, 'neg'
but testfeats[0]
is just a dict, word_feats(tweets.words(fileids=[f]))
:
testfeats = [(word_feats(tweets.words(fileids=[f]))) for f in tweetsids]
So to fix this you would need to define testfeats
to look more like trainfeats
-- each dict returned by word_feats
must be paired with a classification.
Upvotes: 2