Python scikits SVM grid search and classification

Question

I am a beginner in scikits and svm and I would like to check a couple of questions. I have a sample of 700 items and 35 features and I have 3 classes. I have an array X with my samples and features that are scaled using the "preprocessing.scale(X)". The first step is to find the suitable SVM parameters and I am using the grid search with nested cross validation (see http://scikit-learn.org/stable/auto_examples/grid_search_digits.html#). I am using all my samples (X) in the "grid search". During the grid search, the data is split into training and testing (using StratifiedKFold). When I get my SVM parameters, I perform the classification where I divide my data into training and testing. Is it ok to use the same data in the grid search that I will be using during the real classification?

Fred Foo · Accepted Answer

Is it ok to use the same data in the grid search that I will be using during the real classification?

It is ok to use this data for training (fitting) a classifier. Cross validation, as done by StratifiedKFold, is intended for situations where you don't have enough data to hold out a validation set while optimizing the hyperparameters (the algorithm settings). You can also use if you're too lazy to make a validation set splitter and want to rely on scikit-learn's built-in cross validation :)

The refit option to GridSearchCV will retrain the estimator on the full training set after finding the optimal settings with cross validation.

It is, however, senseless to apply a trained classifier to the data you grid searched or trained on, since you already have the labels. If you want to do formal evaluation of a classifier, you should hold out a test set from the very beginning and not touch that again until you've done all your grid searching, validation and fitting.

Python scikits SVM grid search and classification

Answers (2)

Related Questions