kroonike
kroonike

Reputation: 1129

Grid-search with specific validation data

I'm looking for a way to grid-search for hyperparameters in sklearn, without using K-fold validation. I.e I want my grid to train on on specific dataset (X1,y1 in the example below) and validate itself on specific hold-out dataset (X2,y2 in the example below).

X1,y2 = train data
X2,y2 = validation data

clf_ = SVC(kernel='rbf',cache_size=1000)
Cs = [1,10.0,50,100.0,]
Gammas = [ 0.4,0.42,0.44,0.46,0.48,0.5,0.52,0.54,0.56]
clf = GridSearchCV(clf_,dict(C=Cs,gamma=Gammas),
                 cv=???, # validate on X2,y2
                 n_jobs=8,verbose=10)
clf.fit(X1, y1)

Upvotes: 0

Views: 1967

Answers (2)

cgnorthcutt
cgnorthcutt

Reputation: 4036

Use the hypopt Python package (pip install hypopt). It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.

# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
  {'C': [1, 10, 100], 'kernel': ['linear']},
  {'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))

Upvotes: 1

Abhishek Sharma
Abhishek Sharma

Reputation: 2039

clf = GridSearchCV(clf_,dict(C=Cs,gamma=Gammas),cv=???, # validate on X2,y2,n_jobs=8,verbose=10)
  1. n_jobs>1 does not make any sense. If n_jobs=-1 it means the processing will use all the cores on your machine. If it is 1 only one core would be use.

  2. If cv =5 it will run five cross validations for every iteration.

  3. In your case total number of iterations will be 9(size of Cs)*5(Size of gammas)*5(Value of CV)

    1. If you are using cross validation it does not make any sense to hold out the data for rechecking your model. If you are not confident about the performance you can just increase the cv to get a better fit.

This will be very time consuming especially for SVM ,I will rather suggest you to use RandomSearchCV which allows you give the number of iterations you want your model to randomly select.

Upvotes: 0

Related Questions