Reputation:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
train_matrix = vectorizer.fit_transform(train_data['review'])
test_matrix = vectorizer.fit_transform(test_data['review'])
from sklearn.linear_model import LogisticRegression
sentiment_model = LogisticRegression()
sentiment_model = sentiment_model.fit(train_matrix,train_data['sentiment'])
sample_test_data = test_data[10:13]
sample_test_matrix = vectorizer.fit_transform(sample_test_data['review'])
predict = sentiment_model.predict(sample_test_matrix)
Error:
X has 85 features per sample; expecting 121676
ValueErrorTraceback (most recent call last)
in ()
----> 1 predict = model.predict(sample_test_matrix)
~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
Predicted class label per sample.
----------> scores = self.decision_function(X)
if len(scores.shape) == 1: indices = (scores > 0).astype(np.int)
decision_function(self, X)
if X.shape[1] != n_features: raise ValueError("X has %d features per sample; expecting %d" ------------> % (X.shape[1], n_features)) scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 85 features per sample; expecting 121676
Upvotes: 0
Views: 379
Reputation: 8187
I believe the problem you're encountering is as a result of using the fit_transform()
method on your test set instead of just the transform()
method. The CountVectorizer() fit method learns "a vocabulary dictionary of all tokens in the raw documents."
This means that when you are calling fit_transform()
on the training set, it will produce a sparse matrix with number of features equal to the different word tokens it finds in the text you provide it as input. When you again call fit_transform()
on the test set, it generates a different sparse matrix with dimensions based on the unique words in the test set, instead of using the original matrix generated on the training data.
You then fit the LogisticRegression
object to your training data, but when you try to use predict()
on the test data it complains that the input size is different to what you passed in when you trained it because it's got a different number of features.
EDIT: This is also happening when you call the following:
test_matrix = vectorizer.fit_transform(test_data['review'])
You should avoid creating different fits for the CountVectorizer
if you plan on splitting your data into training and test sets, as the resulting dimensions of the sparse matrices will cause problems like the one you're experiencing.
TL;DR:
Try replacing these
test_matrix = vectorizer.fit_transform(test_data['review'])
sample_test_matrix = vectorizer.fit_transform(sample_test_data['review'])
with these
test_matrix = vectorizer.transform(test_data['review'])
sample_test_matrix = vectorizer.transform(sample_test_data['review'])
Upvotes: 2