user9300361
user9300361

Reputation:

Error in prediction of sentiment in Scikit LogisticRegression

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

train_matrix = vectorizer.fit_transform(train_data['review'])

test_matrix = vectorizer.fit_transform(test_data['review'])

Training LogisticRegressor

from sklearn.linear_model import LogisticRegression

sentiment_model = LogisticRegression()

sentiment_model = sentiment_model.fit(train_matrix,train_data['sentiment'])

Examin sample data

sample_test_data = test_data[10:13]

sample_test_matrix = vectorizer.fit_transform(sample_test_data['review'])

predict = sentiment_model.predict(sample_test_matrix)

Error:

X has 85 features per sample; expecting 121676

ValueErrorTraceback (most recent call last)

in ()

----> 1 predict = model.predict(sample_test_matrix)

~\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)

            Predicted class label per sample.

----------> scores = self.decision_function(X)

       if len(scores.shape) == 1:

         indices = (scores > 0).astype(np.int)

decision_function(self, X)

      if X.shape[1] != n_features:

          raise ValueError("X has %d features per sample; expecting %d"
        ------------>   % (X.shape[1], n_features))

    scores = safe_sparse_dot(X, self.coef_.T,

ValueError: X has 85 features per sample; expecting 121676

Upvotes: 0

Views: 379

Answers (1)

Mihai Chelaru
Mihai Chelaru

Reputation: 8187

I believe the problem you're encountering is as a result of using the fit_transform() method on your test set instead of just the transform() method. The CountVectorizer() fit method learns "a vocabulary dictionary of all tokens in the raw documents."

This means that when you are calling fit_transform() on the training set, it will produce a sparse matrix with number of features equal to the different word tokens it finds in the text you provide it as input. When you again call fit_transform() on the test set, it generates a different sparse matrix with dimensions based on the unique words in the test set, instead of using the original matrix generated on the training data.

You then fit the LogisticRegression object to your training data, but when you try to use predict() on the test data it complains that the input size is different to what you passed in when you trained it because it's got a different number of features.

EDIT: This is also happening when you call the following:

test_matrix = vectorizer.fit_transform(test_data['review'])

You should avoid creating different fits for the CountVectorizer if you plan on splitting your data into training and test sets, as the resulting dimensions of the sparse matrices will cause problems like the one you're experiencing.

TL;DR:

Try replacing these

test_matrix = vectorizer.fit_transform(test_data['review'])
sample_test_matrix = vectorizer.fit_transform(sample_test_data['review'])

with these

test_matrix = vectorizer.transform(test_data['review'])
sample_test_matrix = vectorizer.transform(sample_test_data['review'])

Upvotes: 2

Related Questions