Reputation: 6290
I'm using python and I would like to use nested cross-validation with scikit learn. I have found a very good example:
NUM_TRIALS = 30
non_nested_scores = np.zeros(NUM_TRIALS)
nested_scores = np.zeros(NUM_TRIALS)
# Choose cross-validation techniques for the inner and outer loops,
# independently of the dataset.
# E.g "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_scores[i] = clf.best_score_
# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
nested_scores[i] = nested_score.mean()
How can the best set of parameters as well as all set of parameters (with their corresponding score) from the nested cross-validation be accessed?
Upvotes: 4
Views: 3060
Reputation: 63
Vivek Kumar's answer is based on using an explicit outer cv for loop. If OP wants to access the best estimator and best params based on sklearn's cross validation workflow, I'd suggest using cross_validate
instead of cross_val_score
because the former allows you to return the estimator. An added bonus of using cross_validate
is that you can specify multiple metrics.
from sklearn.model_selection import cross_validate
scoring = {"auroc": "roc_auc"} # [1]
nested_scores = cross_validate(clf, X=X_iris, y=y_iris, cv=outer_cv, return_estimator=True, random_state=0)
Then you can access the best model from each cv fold:
best_models = nested_scores['estimator']
for i, model in enumerate(best_models):
best_model = model.best_estimator_
best_params = model.best_params_
[1] for a list of available scores https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
Upvotes: 1
Reputation: 36599
You cannot access individual params and best params from cross_val_score
. What cross_val_score
does internally is clone the supplied estimator and then call fit
and score
methods on it with given X
, y
on individual estimators.
If you want to access the params at each split you can use:
#put below code inside your NUM_TRIALS for loop
cv_iter = 0
temp_nested_scores_train = np.zeros(4)
temp_nested_scores_test = np.zeros(4)
for train, test in outer_cv.split(X_iris):
clf.fit(X_iris[train], y_iris[train])
temp_nested_scores_train[cv_iter] = clf.best_score_
temp_nested_scores_test[cv_iter] = clf.score(X_iris[test], y_iris[test])
#You can access grid search's params here
nested_scores_train[i] = temp_nested_scores_train.mean()
nested_scores_test[i] = temp_nested_scores_test.mean()
Upvotes: 7