Reputation: 3318
I have optimized RandomForest, using GridSearch with a nested cross-validation. After that, I know that with the best parameters, I have to train the whole dataset before making predictions on out-of sample data.
Do I have to fit the model twice? One to find the accuracy estimate by nested cross validation and then with the out-of-sample data?
Please check my code:
#Load data
for name in ["AWA"]:
for el in ['Fp1']:
X=sio.loadmat('/home/TrainVal/{}_{}.mat'.format(name, el))['x']
s_y=sio.loadmat('/home/TrainVal/{}_{}.mat'.format(name, el))['y']
y=np.ravel(s_y)
print(name, el, x.shape, y.shape)
print("")
#Pipeline
clf = Pipeline([('rcl', RobustScaler()),
('clf', RandomForestClassifier())])
#Optimization
#Outer loop
sss_outer = StratifiedShuffleSplit(n_splits=2, test_size=0.1, random_state=1)
#Inner loop
sss_inner = StratifiedShuffleSplit(n_splits=2, test_size=0.1, random_state=1)
# Use a full grid over all parameters
param_grid = {'clf__n_estimators': [10, 12, 15],
'clf__max_features': [3, 5, 10],
}
# Run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=sss_inner, n_jobs=-1)
#FIRST FIT!!!!!
grid_search.fit(X, y)
scores=cross_val_score(grid_search, X, y, cv=sss_outer)
#Show best parameter in inner loop
print(grid_search.best_params_)
#Show Accuracy average of all the outer loops
print(scores.mean())
#SECOND FIT!!!
y_score = grid_search.fit(X, y).score(out-of-sample, y)
print(y_score)
Upvotes: 0
Views: 1883
Reputation: 161
it is like any normal model that you build. Once you have trained your model ( either through CV or normal train-test split) you use the .score
or .predict
using the best_estimator
from gridsearch to go ahead with the prediction
Sample code I have used recently
from sklearn.model_selection import GridSearchCV
bootstrap=[True,False]
max_features=[3,4,5,'auto']
n_estimators=[20,75,100]
import time
rf_model = RandomForestClassifier(random_state=1)
param_grid = dict(bootstrap=bootstrap,max_features=max_features,n_estimators=n_estimators)
grid = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv = 10, n_jobs=-1)
start_time = time.time()
grid_result=grid.fit(train_iv_data, train_dv_data)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print("Execution time: " + str((time.time() - start_time)) + ' ms')
classification_report(grid_result.best_estimator_.predict(test_iv_data) , test_dv_data)
Upvotes: 0
Reputation: 36599
There are a couple of things you need to understand.
When you do your "first fit", that will fit the gird_search model according to the sss_inner
cv, and store the result in grid_search.best_estimator_
(i.e. the best estimator according to the scores on the test data from sss_inner
folds).
Now you are using that grid_search
in cross_val_score
(Nesting). Your fitted model from "first fit" is of no use here. cross_val_score
will clone the estimator, call grid_search.fit() on folds from the sss_outer
(That means that the training data from sss_outer
will be presented to grid_search, which will again split it according to sss_inner
) and present the scores on the test data of sss_outer
. The model from cross_val_score
is not fitted.
Now in your "second fit", you are again fitting as you did in "first fit". No need of doing that, because it is already fitted. Just call grid_search.score()
. It will internally call score()
from the best_estimator_
.
You can look at my answer here to learn more about nested cross validation with grid search.
Upvotes: 1
Reputation: 3550
your grid_search.best_estimator_ contains the corss-validated fitted model with best_params_ parameters, no need to refit again.
You can use:
clf = grid_search.best_estimator_
preds = clf.predict(X_unseen)
Upvotes: 1