nopact
nopact

Reputation: 315

Using hold-out-set for validation in RandomizedSearchCV in scikit-learn?

Is there any way to do RandomizedSearchCV from scikit-learn, when validation data does already exist as a holdout set? I have tried to concat train and validation data and define the cv parameter to split exactly where the two sets where combined, but could not find a proper syntax that is accepted by RandomizedSearchCV.

scikit-learn docu says:

cv : int, cross-validation generator or an iterable, optional
    Determines the cross-validation splitting strategy.
    Possible inputs for cv are:
      - None, to use the default 3-fold cross validation,
      - integer, to specify the number of folds in a `(Stratified)KFold`,
      - An object to be used as a cross-validation generator.
      - An iterable yielding train, test splits.

The last option should somehow work, I hope, but I don't know in which format I have to hand it over.

Any help is appreciated!

Upvotes: 3

Views: 1854

Answers (1)

afsharov
afsharov

Reputation: 5164

Suppose you have the indices of your training samples in train_indices and the indices of your test samples in test_indices. Then, it is sufficient to pass these as a tuple wrapped in a list to the cv parameter of RandomizedSearchCV. A MWE to demonstrate:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV


X, y = make_classification(n_samples=10)

param_distributions = {
    'n_estimators': [10, 20, 30, 40]
}

train_indices = [0, 1, 2, 3, 4]
test_indices = [5, 6, 7, 8, 9]
cv = [(train_indices, test_indices)]

search = RandomizedSearchCV(
    RandomForestClassifier(),
    param_distributions=param_distributions,
    cv=cv,
    n_iter=2
)

search.fit(X, y)

This will always train and test the estimator on the same samples. If your data is stored pandas dataframes, e.g. df, use df.index.values to get the indices you need.

Upvotes: 1

Related Questions