Reputation: 31
I would like to classify word series of each row of a column. I have defined a function that returns a dictionary from each series, the positive and negative dictionaries and train_set. But when I started to define the classifier, the code crashes at that level.
I have this code:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
def word_feats(words, val):
return {word: val for word in words}
voc_pos = [ 'beauty', 'good', 'happy']
voc_neg = [ 'bad', 'sick','lazy']
feat = {}
pos_feats = word_feats(voc_pos, 'pos')
neg_feats = word_feats(voc_neg, 'neg')
train_set = {**pos_feats, **neg_feats}
classifier = NaiveBayesClassifier.train(train_set)
Full error traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ion/.local/lib/python3.6/site-packages/nltk/classify/naivebayes.py", line 206, in train
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack (expected 2)
Upvotes: 2
Views: 106
Reputation: 19885
The reason is pretty simple: NaiveBayesClassifier
expects an iterable of 2-tuples
comprising a featureset and a label.
For example, in your context, the positive word featureset would be something like this:
[({'beauty': 0.2}, 'pos'),
({'good': 0.3}, 'pos'),
({'happy': 0.4}, 'pos')]
Accordingly, the data you should be feeding to NaiveBayesClassifier
should be of this form:
labelled_featuresets = [({'beauty': 0.2}, 'pos'),
({'good': 0.3}, 'pos'),
({'happy': 0.4}, 'pos'),
({'bad': 0.5}, 'neg'),
({'sick': 0.3}, 'neg'),
({'lazy': 0.2}, 'neg')]
classifier = NaiveBayesClassifier.train(labelled_featuresets)
However, if you look at the wider context of what you're doing, I am not sure this really makes sense, for a few reasons.
The principal one is that you actually don't have a way to decide what those scores are in the first place. You seem to be doing sentiment analysis; the simplest and most common way is to download a pre-trained mapping from words to sentiment scores, so you could try that.
The second is that the featureset is meant as a mapping from feature values to labels. If you look at the nltk official example, the featureset looks something like this:
[({'last_letter': 't'}, 'female'),
({'last_letter': 'a'}, 'female'),
({'last_letter': 'h'}, 'female'),
({'last_letter': 'l'}, 'female'),
({'last_letter': 'a'}, 'female'),
({'last_letter': 'a'}, 'female'),
({'last_letter': 'e'}, 'female'),
({'last_letter': 'r'}, 'male'),
({'last_letter': 'a'}, 'male'),
({'last_letter': 'n'}, 'female')]
The workflow here takes a name, generates a single feature from it (the last letter), and then uses the last letter of each name, in conjunction with whether it is male or female (the label) to determine the conditional probability of a name's gender given its last letter.
On the other hand, what you're doing is attempting to decide if a sentence is positive or negative, which means that you need (simplifying here) to tell if each individual word is positive or negative. However, if so, then both your feature and your label mean the exact same thing!
Upvotes: 3