ValueError: not enough values to unpack

I am trying to learn (on Python3) how to do sentiment analysis for NLP and I am using the "UMICH SI650 - Sentiment Classification" Database available on Kaggle: https://www.kaggle.com/c/si650winter11

At the moment I am trying to generate a vocabulary with some loops, here is the code:

    import collections
    import nltk
    import os

    Directory = "../Databases"


    # Read training data and generate vocabulary
    max_length = 0
    freqs = collections.Counter()
    num_recs = 0
    training = open(os.path.join(Directory, "train_sentiment.txt"), 'rb')
    for line in training:
        if not line:
            continue
        label, sentence = line.strip().split("\t".encode())
        words = nltk.word_tokenize(sentence.decode("utf-8", "ignore").lower())
        if len(words) > max_length:
            max_length = len(words)
        for word in words:
            freqs[word] += 1
        num_recs += 1
    training.close()

I keep getting this error, that I don't fully understand:

in label, sentence = line.strip().split("\t".encode()) ValueError: not enough values to unpack (expected 2, got 1)

I tried to add

if not line:
        continue

like suggested in here: ValueError : not enough values to unpack. why? But it didn't work for my case. How can I solve this error?

Thanks a lot in advance,

Upvotes: 1

Answers (3)

alvas

Reputation: 122012

Here's a cleaner way to read the dataset from https://www.kaggle.com/c/si650winter11

Firstly, context manager is your friend, use it, http://book.pythontips.com/en/latest/context_managers.html

Secondly, if it's a text file, avoid reading it as a binary, i.e. open(filename, 'r') not open(filename, 'rb'), then there's no need to mess with str/byte and encode/decode.

And now:

from nltk import word_tokenize
from collections import Counter
word_counts = Counter()
with open('training.txt', 'r') as fin:
    for line in fin:
        label, text = line.strip().split('\t')
        # Avoid lowercasing before tokenization.
        # lowercasing after tokenization is much better
        # just in case the tokenizer uses captialization as cues. 
        word_counts.update(map(str.lower, word_tokenize(text)))

print(word_counts)

Upvotes: 1

PMende

Reputation: 5460

The easiest way to resolve this would be to put the unpacking statement into a try/except block. Something like:

try:
    label, sentence = line.strip().split("\t".encode())
except ValueError:
    print(f'Error line: {line}')
    continue

My guess is that some of your lines have a label with nothing but whitespace afterwards.

Upvotes: 1

robx

Reputation: 2329

You should check for the case where you have the wrong number of fields:

 if not line:
     continue
 fields = line.strip().split("\t".encode())
 if len(fields) != 2:
     # you could print(fields) here to help debug
     continue
 label, sentence = fields

Upvotes: 0

ValueError: not enough values to unpack

Answers (3)

Related Questions