C.G
C.G

Reputation: 87

ValueError: Number of features of the model must match the input (sklearn)

I am trying to run a classifier on some movie review data. The data had already been separated into reviews_train.txt and reviews_test.txt. I then loaded the data in and separated each into review and label (either positive (0) or negative (1)) and then vectorized this data. Here is my code:

from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
#read the reviews and their polarities from a given file

def loadData(fname):
    reviews=[]
    labels=[]
    f=open(fname)
    for line in f:
        review,rating=line.strip().split('\t')  
        reviews.append(review.lower())    
        labels.append(int(rating))
    f.close()

    return reviews,labels

rev_train,labels_train=loadData('reviews_train.txt')
rev_test,labels_test=loadData('reviews_test.txt')

#vectorizing the input
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.fit_transform(rev_test)

clf = tree.DecisionTreeClassifier()
clf = clf.fit(vectors_train, labels_train)

#prediction
pred=clf.predict(vectors_test)
#print accuracy

print (accuracy_score(pred,labels_test))

However I keep getting this error:

ValueError: Number of features of the model must match the input.
Model n_features is 118686 and input n_features is 34169 

I am pretty new to Python so I apologize in advance if this is a simple fix.

Upvotes: 3

Views: 3775

Answers (1)

rayryeng
rayryeng

Reputation: 104474

The problem is right here:

vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.fit_transform(rev_test)

You call fit_transform on both the training and testing data. fit_transform simultaneously creates the model stored in vectorizer then uses the model to create the vectors. Because you call it twice, what's happening is that vectors_train is first created and the output feature vectors are generated then you overwrite the model with the second call to fit_transform with the test data. This results in the difference in vector size as you trained the decision tree with different length features in comparison to the test data.

When performing testing, you must transform the data with the same model that was used for training. Therefore, don't call fit_transform on the testing data - just use transform instead to use the already created model:

vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_train = vectorizer.fit_transform(rev_train)
vectors_test = vectorizer.transform(rev_test) # Change here

Upvotes: 2

Related Questions