Reputation: 27
I am confused with how my test and training sparse matrices have different number of features after performing the same preprocessing
this is preventing me from predicting my test data
def vectorizer(X):
vectorizer = CountVectorizer(stop_words = 'english')
vectorizer.fit(X)
X = vectorizer.fit_transform(X)
return X
other_features = ["n_steps", "n_ingredients"]
features = df_train[other_features]
test_features = df_test[other_features]
name = vectorizer(df_train.name)
steps = vectorizer(df_train.steps)
ingr = vectorizer(df_train.ingredients)
test_name = vectorizer(df_test.name)
test_steps = vectorizer(df_test.steps)
test_ingr = vectorizer(df_test.ingredients)
X = hstack([steps,ingr, name, np.array(features)])
X_test = hstack([test_steps, test_ingr, test_name, np.array(test_features)])
clf = LogisticRegression(C = 0.01, max_iter = 1000000, penalty = 'l2')
clf.fit(X, y)
predictions = clf.predict(X_test)
the error raised during prediction:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-40-1534a274c605> in <module>
----> 1 predictions = clf.predict(X_test)
/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_base.py in predict(self, X)
305 Predicted class label per sample.
306 """
--> 307 scores = self.decision_function(X)
308 if len(scores.shape) == 1:
309 indices = (scores > 0).astype(np.int)
/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_base.py in decision_function(self, X)
284 n_features = self.coef_.shape[1]
285 if X.shape[1] != n_features:
--> 286 raise ValueError("X has %d features per sample; expecting %d"
287 % (X.shape[1], n_features))
288
ValueError: X has 16417 features per sample; expecting 31765
Upvotes: 1
Views: 845
Reputation: 758
Vectorize your testing data with the vectorizer you fit to your training data. Otherwise the input encoding isn't the same, so anything your model learned on the first dataset is irrelevant.
As to why it's a different length: the number of unique (non-excluded) words becomes the length of the vector when you fit the vectorizer. If there's a word in one or the other that is only in one data point, it will be missing from the other one. So the two don't have the same number of unique words, and they end up different lengths.
So tl;dr. Fit once, use twice.
count = CountVectorizer()
count.fit(X_train)
X_train_vec = count.transform(X_train)
X_test_vec = count.transform(Y_train)
If you want to automate that you can put it in a pipeline.
pipeline = Pipeline(('count', CountVectorizer()),
('model', [yourmodel])
# Regression predictions
Y_train = pipeline.fit_transform(X_train)
Y_test = pipeline.transform(X_test)
test_acc = metrics.accuracy(X_test, Y_test, Y_test_labels) # Or whatever metric
Upvotes: 1
Reputation: 13853
You must use the same CountVectorizer
instance for both datasets. When you use the .fit_transform
method, the transformer internally stores the transformation that it learned, so that it can be re-applied later.
In your code, not only are you creating a new CountVectorizer
instance for every data set, you are also freshly training it on every data set. If you think of "test" data as an approximation of "out of sample" data (i.e. data that you do not currently have), you should see why doing it your way makes no sense, which is why you are getting nonsense results.
Correct usage would look something like this:
vectorizer = CountVectorizer(stop_words = 'english')
classifier = LogisticRegression(C = 0.01, max_iter = 1000000, penalty = 'l2')
x_train = vectorizer.fit_transform(data_train)
clf.fit(x_train, y_train)
pred_train = clf.predict(x_train)
x_test = vectorizer.transform(data_test)
pred_test = clf.predict(x_test)
Note that in your case you might want to also use of Pipeline
and ColumnTransformer
.
I recommend reading this guide for more information.
Upvotes: 2