Reputation: 87
I am trying to run a classifier on some movie review data. The data had already been separated into reviews_train.txt
and reviews_test.txt
. I then loaded the data in and separated each into review and label (either positive (0) or negative (1)) and then vectorized this data. Here is my code:
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
#read the reviews and their polarities from a given file
def loadData(fname):
reviews=[]
labels=[]
f=open(fname)
for line in f:
review,rating=line.strip().split('\t')
reviews.append(review.lower())
labels.append(int(rating))
f.close()
return reviews,labels
rev_train,labels_train=loadData('reviews_train.txt')
rev_test,labels_test=loadData('reviews_test.txt')
#vectorizing the input
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.fit_transform(rev_test)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(vectors_train, labels_train)
#prediction
pred=clf.predict(vectors_test)
#print accuracy
print (accuracy_score(pred,labels_test))
However I keep getting this error:
ValueError: Number of features of the model must match the input.
Model n_features is 118686 and input n_features is 34169
I am pretty new to Python so I apologize in advance if this is a simple fix.
Upvotes: 3
Views: 3775
Reputation: 104474
The problem is right here:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.fit_transform(rev_test)
You call fit_transform
on both the training and testing data. fit_transform
simultaneously creates the model stored in vectorizer
then uses the model to create the vectors. Because you call it twice, what's happening is that vectors_train
is first created and the output feature vectors are generated then you overwrite the model with the second call to fit_transform
with the test data. This results in the difference in vector size as you trained the decision tree with different length features in comparison to the test data.
When performing testing, you must transform the data with the same model that was used for training. Therefore, don't call fit_transform
on the testing data - just use transform
instead to use the already created model:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.transform(rev_test) # Change here
Upvotes: 2