oog1119
oog1119

Reputation: 27

Dimensions different after performing the same preprocessing with CountVectorizer

I am confused with how my test and training sparse matrices have different number of features after performing the same preprocessing

this is preventing me from predicting my test data

def vectorizer(X):
    vectorizer = CountVectorizer(stop_words = 'english')
    vectorizer.fit(X)
    X = vectorizer.fit_transform(X)
    
    return X
other_features = ["n_steps", "n_ingredients"]
features = df_train[other_features]
test_features = df_test[other_features]

name = vectorizer(df_train.name)
steps = vectorizer(df_train.steps)
ingr = vectorizer(df_train.ingredients)


test_name = vectorizer(df_test.name)
test_steps = vectorizer(df_test.steps)
test_ingr = vectorizer(df_test.ingredients)

X = hstack([steps,ingr, name, np.array(features)])
X_test = hstack([test_steps, test_ingr, test_name, np.array(test_features)])
clf = LogisticRegression(C = 0.01, max_iter = 1000000, penalty = 'l2')
clf.fit(X, y)
predictions = clf.predict(X_test)

the error raised during prediction:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-1534a274c605> in <module>
----> 1 predictions = clf.predict(X_test)

/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_base.py in predict(self, X)
    305             Predicted class label per sample.
    306         """
--> 307         scores = self.decision_function(X)
    308         if len(scores.shape) == 1:
    309             indices = (scores > 0).astype(np.int)

/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_base.py in decision_function(self, X)
    284         n_features = self.coef_.shape[1]
    285         if X.shape[1] != n_features:
--> 286             raise ValueError("X has %d features per sample; expecting %d"
    287                              % (X.shape[1], n_features))
    288 

ValueError: X has 16417 features per sample; expecting 31765

Upvotes: 1

Views: 845

Answers (2)

user3252344
user3252344

Reputation: 758

Vectorize your testing data with the vectorizer you fit to your training data. Otherwise the input encoding isn't the same, so anything your model learned on the first dataset is irrelevant.

As to why it's a different length: the number of unique (non-excluded) words becomes the length of the vector when you fit the vectorizer. If there's a word in one or the other that is only in one data point, it will be missing from the other one. So the two don't have the same number of unique words, and they end up different lengths.

So tl;dr. Fit once, use twice.


count = CountVectorizer()
count.fit(X_train)
X_train_vec = count.transform(X_train)
X_test_vec = count.transform(Y_train)

If you want to automate that you can put it in a pipeline.


pipeline = Pipeline(('count', CountVectorizer()),
                    ('model', [yourmodel])

# Regression predictions
Y_train = pipeline.fit_transform(X_train)
Y_test = pipeline.transform(X_test)

test_acc = metrics.accuracy(X_test, Y_test, Y_test_labels) # Or whatever metric

Upvotes: 1

shadowtalker
shadowtalker

Reputation: 13853

You must use the same CountVectorizer instance for both datasets. When you use the .fit_transform method, the transformer internally stores the transformation that it learned, so that it can be re-applied later.

In your code, not only are you creating a new CountVectorizer instance for every data set, you are also freshly training it on every data set. If you think of "test" data as an approximation of "out of sample" data (i.e. data that you do not currently have), you should see why doing it your way makes no sense, which is why you are getting nonsense results.

Correct usage would look something like this:

vectorizer = CountVectorizer(stop_words = 'english')
classifier = LogisticRegression(C = 0.01, max_iter = 1000000, penalty = 'l2')

x_train = vectorizer.fit_transform(data_train)
clf.fit(x_train, y_train)
pred_train = clf.predict(x_train)

x_test = vectorizer.transform(data_test)
pred_test = clf.predict(x_test)

Note that in your case you might want to also use of Pipeline and ColumnTransformer.

I recommend reading this guide for more information.

Upvotes: 2

Related Questions