Supervised machine learning with scikit-learn

Question

This is the first time I'm doing supervised machine learning. This is a pretty advanced topic (at least for me) and I find it hard to specify a question, since I'm not sure what is going wrong.

# Create a training list and test list (looks something like this):
train = [('this hostel was nice',2),('i hate this hostel',1)]
test = [('had a wonderful time',2),('terrible experience',1)]

# Loading modules
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

# Use a BOW representation of the reviews
vectorizer = CountVectorizer(stop_words='english') 
train_features = vectorizer.fit_transform([r[0] for r in train]) 
test_features = vectorizer.fit([r[0] for r in test])

# Fit a naive bayes model to the training data
nb = MultinomialNB()
nb.fit(train_features, [r[1] for r in train])

# Use the classifier to predict classification of test dataset
predictions = nb.predict(test_features)
actual=[r[1] for r in test]

Here I get the error:

float() argument must be a string or a number, not 'CountVectorizer'

This confuses me, since the original ratings that I have zipped up in with the reviews are:

type(ratings_new[0])
int

Miriam Farber · Accepted Answer

You should change the line

test_features = vectorizer.fit([r[0] for r in test])

to:

test_features = vectorizer.transform([r[0] for r in test])

The reason is that you already used your training data to fit vectorizer, so you don't need to fit it again on your test data. Instead, you need to transform it.

Supervised machine learning with scikit-learn

Answers (1)

Related Questions