bella
bella

Reputation: 31

"Too many values to unpack" ValueError while training classifier

I would like to classify word series of each row of a column. I have defined a function that returns a dictionary from each series, the positive and negative dictionaries and train_set. But when I started to define the classifier, the code crashes at that level.

I have this code:

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier

def word_feats(words, val): 
    return {word: val for word in words}

voc_pos = [ 'beauty', 'good', 'happy']
voc_neg = [ 'bad', 'sick','lazy']

feat = {}
pos_feats = word_feats(voc_pos, 'pos') 
neg_feats = word_feats(voc_neg, 'neg')
train_set = {**pos_feats, **neg_feats}

classifier = NaiveBayesClassifier.train(train_set) 

Full error traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ion/.local/lib/python3.6/site-packages/nltk/classify/naivebayes.py", line 206, in train
    for featureset, label in labeled_featuresets:
ValueError: too many values to unpack (expected 2)

Upvotes: 2

Views: 106

Answers (1)

gmds
gmds

Reputation: 19885

The reason is pretty simple: NaiveBayesClassifier expects an iterable of 2-tuples comprising a featureset and a label.

For example, in your context, the positive word featureset would be something like this:

[({'beauty': 0.2}, 'pos'),
 ({'good': 0.3}, 'pos'),
 ({'happy': 0.4}, 'pos')]

Accordingly, the data you should be feeding to NaiveBayesClassifier should be of this form:

labelled_featuresets = [({'beauty': 0.2}, 'pos'),
                        ({'good': 0.3}, 'pos'),
                        ({'happy': 0.4}, 'pos'),
                        ({'bad': 0.5}, 'neg'),
                        ({'sick': 0.3}, 'neg'),
                        ({'lazy': 0.2}, 'neg')]

classifier = NaiveBayesClassifier.train(labelled_featuresets)

However, if you look at the wider context of what you're doing, I am not sure this really makes sense, for a few reasons.

The principal one is that you actually don't have a way to decide what those scores are in the first place. You seem to be doing sentiment analysis; the simplest and most common way is to download a pre-trained mapping from words to sentiment scores, so you could try that.

The second is that the featureset is meant as a mapping from feature values to labels. If you look at the nltk official example, the featureset looks something like this:

[({'last_letter': 't'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'h'}, 'female'),
 ({'last_letter': 'l'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'r'}, 'male'),
 ({'last_letter': 'a'}, 'male'),
 ({'last_letter': 'n'}, 'female')]

The workflow here takes a name, generates a single feature from it (the last letter), and then uses the last letter of each name, in conjunction with whether it is male or female (the label) to determine the conditional probability of a name's gender given its last letter.

On the other hand, what you're doing is attempting to decide if a sentence is positive or negative, which means that you need (simplifying here) to tell if each individual word is positive or negative. However, if so, then both your feature and your label mean the exact same thing!

Upvotes: 3

Related Questions