Reputation: 525
I am doing sentiment classification using NLTK NaiveBayesClassifier. I trained and test the model with the labeled data. Now I want to predict sentiments of the data that is not labeled. However, I run into the error. The line that is giving error is :
score_1 = analyzer.evaluate(list(zip(new_data['Articles'])))
The error is :
ValueError: not enough values to unpack (expected 2, got 1)
Below is the code:
import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)
new_data = pd.read_csv("Japan Data.csv", header=0)
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])
from unidecode import unidecode
from nltk import word_tokenize
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import extract_unigram_feats
TRAINING_COUNT = 350
def clean_text(text):
text = text.replace("<br />", " ")
return text
analyzer = SentimentAnalyzer()
vocabulary = analyzer.all_words([(word_tokenize(unidecode(clean_text(instance))))
for instance in train_x[:TRAINING_COUNT]])
print("Vocabulary: ", len(vocabulary))
print("Computing Unigran Features ...")
unigram_features = analyzer.unigram_word_feats(vocabulary, min_freq=10)
print("Unigram Features: ", len(unigram_features))
analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_features)
# Build the training set
_train_X = analyzer.apply_features([(word_tokenize(unidecode(clean_text(instance))))
for instance in train_x[:TRAINING_COUNT]], labeled=False)
# Build the test set
_test_X = analyzer.apply_features([(word_tokenize(unidecode(clean_text(instance))))
for instance in test_x], labeled=False)
trainer = NaiveBayesClassifier.train
classifier = analyzer.train(trainer, zip(_train_X, train_y[:TRAINING_COUNT]))
score = analyzer.evaluate(list(zip(_test_X, test_y)))
print("Accuracy: ", score['Accuracy'])
score_1 = analyzer.evaluate(list(zip(new_data['Articles'])))
print(score_1)
I understand that the problem is arising because I have to give two parameters is the line which is giving an error but I don't know how to do this.
Thanks in Advance.
Upvotes: 0
Views: 302
Reputation: 1364
Documentation and example
The line that gives you the error calls the method SentimentAnalyzer.evaluate(...) . This method does the following.
Evaluate and print classifier performance on the test set.
See SentimentAnalyzer.evaluate.
The method has one mandatory parameter: test_set .
test_set – A list of (tokens, label) tuples to use as gold set.
In the example at http://www.nltk.org/howto/sentiment.html test_set has the following structure:
[({'contains(,)': False, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ({'contains(,)': True, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ...]
Here is a symbolic representation of the structure.
[(dictionary,label), ... , (dictionary,label)]
Error in your code
You are passing
list(zip(new_data['Articles']))
to SentimentAnalyzer.evaluate. I assume your getting the error because
list(zip(new_data['Articles']))
does not create a list of (tokens, label) tuples. You can check that by creating a variable which contains the list and printing it or looking at the value of the variable while debugging. E.G.
test_set = list(zip(new_data['Articles']))
print("begin test_set")
print(test_set)
print("end test_set")
You are calling evaluate correctly 3 lines above the one that is giving the error.
score = analyzer.evaluate(list(zip(_test_X, test_y)))
I guess you want to call SentimentAnalyzer.classify(instance) to predict unlabeled data. See SentimentAnalyzer.classify.
Upvotes: 1