Using hold-out-set for validation in RandomizedSearchCV in scikit-learn?

Question

Is there any way to do RandomizedSearchCV from scikit-learn, when validation data does already exist as a holdout set? I have tried to concat train and validation data and define the cv parameter to split exactly where the two sets where combined, but could not find a proper syntax that is accepted by RandomizedSearchCV.

scikit-learn docu says:

cv : int, cross-validation generator or an iterable, optional
    Determines the cross-validation splitting strategy.
    Possible inputs for cv are:
      - None, to use the default 3-fold cross validation,
      - integer, to specify the number of folds in a `(Stratified)KFold`,
      - An object to be used as a cross-validation generator.
      - An iterable yielding train, test splits.

The last option should somehow work, I hope, but I don't know in which format I have to hand it over.

Any help is appreciated!

afsharov · Accepted Answer

Suppose you have the indices of your training samples in train_indices and the indices of your test samples in test_indices. Then, it is sufficient to pass these as a tuple wrapped in a list to the cv parameter of RandomizedSearchCV. A MWE to demonstrate:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV


X, y = make_classification(n_samples=10)

param_distributions = {
    'n_estimators': [10, 20, 30, 40]
}

train_indices = [0, 1, 2, 3, 4]
test_indices = [5, 6, 7, 8, 9]
cv = [(train_indices, test_indices)]

search = RandomizedSearchCV(
    RandomForestClassifier(),
    param_distributions=param_distributions,
    cv=cv,
    n_iter=2
)

search.fit(X, y)

This will always train and test the estimator on the same samples. If your data is stored pandas dataframes, e.g. df, use df.index.values to get the indices you need.

Using hold-out-set for validation in RandomizedSearchCV in scikit-learn?

Answers (1)

Related Questions