Reputation: 840
I am trying to learn (on Python3) how to do sentiment analysis for NLP and I am using the "UMICH SI650 - Sentiment Classification" Database available on Kaggle: https://www.kaggle.com/c/si650winter11
At the moment I am trying to generate a vocabulary with some loops, here is the code:
import collections
import nltk
import os
Directory = "../Databases"
# Read training data and generate vocabulary
max_length = 0
freqs = collections.Counter()
num_recs = 0
training = open(os.path.join(Directory, "train_sentiment.txt"), 'rb')
for line in training:
if not line:
continue
label, sentence = line.strip().split("\t".encode())
words = nltk.word_tokenize(sentence.decode("utf-8", "ignore").lower())
if len(words) > max_length:
max_length = len(words)
for word in words:
freqs[word] += 1
num_recs += 1
training.close()
I keep getting this error, that I don't fully understand:
in label, sentence = line.strip().split("\t".encode()) ValueError: not enough values to unpack (expected 2, got 1)
I tried to add
if not line:
continue
like suggested in here: ValueError : not enough values to unpack. why? But it didn't work for my case. How can I solve this error?
Thanks a lot in advance,
Upvotes: 1
Views: 3995
Reputation: 122012
Here's a cleaner way to read the dataset from https://www.kaggle.com/c/si650winter11
Firstly, context manager is your friend, use it, http://book.pythontips.com/en/latest/context_managers.html
Secondly, if it's a text file, avoid reading it as a binary, i.e. open(filename, 'r')
not open(filename, 'rb')
, then there's no need to mess with str/byte and encode/decode.
And now:
from nltk import word_tokenize
from collections import Counter
word_counts = Counter()
with open('training.txt', 'r') as fin:
for line in fin:
label, text = line.strip().split('\t')
# Avoid lowercasing before tokenization.
# lowercasing after tokenization is much better
# just in case the tokenizer uses captialization as cues.
word_counts.update(map(str.lower, word_tokenize(text)))
print(word_counts)
Upvotes: 1
Reputation: 5460
The easiest way to resolve this would be to put the unpacking statement into a try/except
block. Something like:
try:
label, sentence = line.strip().split("\t".encode())
except ValueError:
print(f'Error line: {line}')
continue
My guess is that some of your lines have a label with nothing but whitespace afterwards.
Upvotes: 1
Reputation: 2329
You should check for the case where you have the wrong number of fields:
if not line:
continue
fields = line.strip().split("\t".encode())
if len(fields) != 2:
# you could print(fields) here to help debug
continue
label, sentence = fields
Upvotes: 0