Reputation: 914
One problem I came across when trying to predict with a feature selected data set, is that once you have selected certain features, if you were to predict on the test data set, the test data set features would not align because the training data set would have less features due to feature selection. How do you implement feature selection properly such that the test data set would have the same features as the training data set?
Example:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
X.shape
(150, 4)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape
(150, 2)
Upvotes: 1
Views: 1765
Reputation: 95907
You have to transform
your testing set too... And dont use fit_transform
, but just transform
. This requires you to save your SelectKBest
object, so something to the effect of:
selector = SelectKBest(chi2, k=2)
X_train_clean = selector.fit_transform(X_train, y_train)
X_test_clean = selector.transform(X_test)
Upvotes: 3
Reputation: 16079
I believe you want to create a feature_selector object by fitting with SelectKBest
first and then transform
your test data. Like so:
feature_selector = SelectKBest(chi2, k=2).fit(X_train, y)
X_train_pruned = feature_selector.transform(X_train)
X_test_pruned = feature_selector.transform(X_test)
Upvotes: 0