andreSmol
andreSmol

Reputation: 1038

Python scikits SVM grid search and classification

I am a beginner in scikits and svm and I would like to check a couple of questions. I have a sample of 700 items and 35 features and I have 3 classes. I have an array X with my samples and features that are scaled using the "preprocessing.scale(X)". The first step is to find the suitable SVM parameters and I am using the grid search with nested cross validation (see http://scikit-learn.org/stable/auto_examples/grid_search_digits.html#). I am using all my samples (X) in the "grid search". During the grid search, the data is split into training and testing (using StratifiedKFold). When I get my SVM parameters, I perform the classification where I divide my data into training and testing. Is it ok to use the same data in the grid search that I will be using during the real classification?

Upvotes: 0

Views: 2036

Answers (2)

Fred Foo
Fred Foo

Reputation: 363567

Is it ok to use the same data in the grid search that I will be using during the real classification?

It is ok to use this data for training (fitting) a classifier. Cross validation, as done by StratifiedKFold, is intended for situations where you don't have enough data to hold out a validation set while optimizing the hyperparameters (the algorithm settings). You can also use if you're too lazy to make a validation set splitter and want to rely on scikit-learn's built-in cross validation :)

The refit option to GridSearchCV will retrain the estimator on the full training set after finding the optimal settings with cross validation.

It is, however, senseless to apply a trained classifier to the data you grid searched or trained on, since you already have the labels. If you want to do formal evaluation of a classifier, you should hold out a test set from the very beginning and not touch that again until you've done all your grid searching, validation and fitting.

Upvotes: 4

dmytro
dmytro

Reputation: 1303

I'm not a machine learning expert, but as far as i know, the advantage of the Cross-Validation is that it's overfitting-safe. Hence, it should be totally ok to use the classifier with best performance (according to CV results) for the final evaluation.

The question is, however, why do you need to do the "real classification" on the data you already have labels for? What is the final goal (SVM performance evaluation or classification)?

Upvotes: 0

Related Questions