Reputation: 65
My question is should I perform selectkbest
sklearn function for feature selection on the whole dataset first and then partion the dataset into training and testing set or should I perform selectkbest
on training and test datset after they have been partitioned? In the second one is there a possibility that different k features would be selected for the test dataset than that was was used for training? I am very new to machine learning and just learned a bit about feature selection recently.
I used the univariate feature selection example here to learn about selectkbest - http://scikit-learn.org/stable/modules/feature_selection.html as example
Upvotes: 1
Views: 2422
Reputation: 3196
Technically speaking, you should fit the selectKbest on training set and then 'transform' the test set with the fitted selector. This is because you should not use your test data in any part of your training procedure.
Imagine applying the model to new data at a later stage. In this case, you will HAVE TO 'transform' those data using the selectKbest model you had trained on the training data. So, this is the right procedure for more accurate performance estimates.
If you implement some cross-validation scheme, you should repeat this procedure for each CV fold in order to get a correct estimate about the classifier (or regressor) performance.
Upvotes: 3