What to do with the best_score from a grid search obtained from a nested cross validation?

I have optimized RandomForest, using GridSearch with a nested cross-validation. After that, I know that with the best parameters, I have to train the whole dataset before making predictions on out-of sample data.

Do I have to fit the model twice? One to find the accuracy estimate by nested cross validation and then with the out-of-sample data?

Please check my code:

#Load data
for name in ["AWA"]:
for el in ['Fp1']:
    X=sio.loadmat('/home/TrainVal/{}_{}.mat'.format(name, el))['x']
    s_y=sio.loadmat('/home/TrainVal/{}_{}.mat'.format(name, el))['y']
    y=np.ravel(s_y)

    print(name, el, x.shape, y.shape) 
    print("")


#Pipeline
clf = Pipeline([('rcl', RobustScaler()),
                ('clf', RandomForestClassifier())])   

#Optimization
#Outer loop
sss_outer = StratifiedShuffleSplit(n_splits=2, test_size=0.1, random_state=1)
#Inner loop
sss_inner = StratifiedShuffleSplit(n_splits=2, test_size=0.1, random_state=1)


# Use a full grid over all parameters
param_grid = {'clf__n_estimators': [10, 12, 15],
              'clf__max_features': [3, 5, 10],
             }


# Run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=sss_inner, n_jobs=-1)
#FIRST FIT!!!!!
grid_search.fit(X, y)
scores=cross_val_score(grid_search, X, y, cv=sss_outer)

#Show best parameter in inner loop
print(grid_search.best_params_)

#Show Accuracy average of all the outer loops 
print(scores.mean())

#SECOND FIT!!!
y_score = grid_search.fit(X, y).score(out-of-sample, y)
print(y_score)

Upvotes: 0

Answers (3)

Vivek

Reputation: 161

it is like any normal model that you build. Once you have trained your model ( either through CV or normal train-test split) you use the .score or .predict using the best_estimator from gridsearch to go ahead with the prediction

Sample code I have used recently

from sklearn.model_selection import GridSearchCV

bootstrap=[True,False]

max_features=[3,4,5,'auto']

n_estimators=[20,75,100]

import time

rf_model = RandomForestClassifier(random_state=1)

param_grid = dict(bootstrap=bootstrap,max_features=max_features,n_estimators=n_estimators)

grid = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv = 10, n_jobs=-1)
start_time = time.time()

grid_result=grid.fit(train_iv_data, train_dv_data)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

print("Execution time: " + str((time.time() - start_time)) + ' ms')

classification_report(grid_result.best_estimator_.predict(test_iv_data) , test_dv_data)

Upvotes: 0

Vivek Kumar

Reputation: 36599

There are a couple of things you need to understand.

When you do your "first fit", that will fit the gird_search model according to the sss_inner cv, and store the result in grid_search.best_estimator_ (i.e. the best estimator according to the scores on the test data from sss_inner folds).

Now you are using that grid_search in cross_val_score (Nesting). Your fitted model from "first fit" is of no use here. cross_val_score will clone the estimator, call grid_search.fit() on folds from the sss_outer (That means that the training data from sss_outer will be presented to grid_search, which will again split it according to sss_inner) and present the scores on the test data of sss_outer. The model from cross_val_score is not fitted.

Now in your "second fit", you are again fitting as you did in "first fit". No need of doing that, because it is already fitted. Just call grid_search.score(). It will internally call score() from the best_estimator_.

You can look at my answer here to learn more about nested cross validation with grid search.

Upvotes: 1

Ash

Reputation: 3550

your grid_search.best_estimator_ contains the corss-validated fitted model with best_params_ parameters, no need to refit again.

You can use:

clf = grid_search.best_estimator_
preds = clf.predict(X_unseen)

Upvotes: 1

What to do with the best_score from a grid search obtained from a nested cross validation?

Answers (3)

Related Questions