Christian
Christian

Reputation: 914

Hyperparameter tuning on the whole dataset?

It may be a weird question because I don't fully understand hyperparameter-tuning yet.

Currently I'm using gridSearchCV of sklearn to tune the parameters of a randomForestClassifier like this:

gs = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)
results = gs.cv_results_

After that I check the gs object for the best_params and best_score. Now I'm using best_params to instantiate a RandomForestClassifier and use stratified validation again to record metrics and print a confusion matrix:

rf = RandomForestClassifier(n_estimators=1000, min_samples_leaf=7, max_depth=18, criterion='entropy', random_state=42)
accuracy = []
metrics = {'accuracy':[], 'precision':[], 'recall':[], 'fscore':[], 'support':[]}
counter = 0

print('################################################### RandomForest ###################################################')
for train_index, test_index in skf.split(X_Distances,Y):
    X_train, X_test = X_Distances[train_index], X_Distances[test_index]
    y_train, y_test = Y[train_index], Y[test_index]
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)

    precision, recall, fscore, support = np.round(score(y_test, y_pred), 2)
    metrics['accuracy'].append(round(accuracy_score(y_test, y_pred), 2))
    metrics['precision'].append(precision)
    metrics['recall'].append(recall)
    metrics['fscore'].append(fscore)
    metrics['support'].append(support)

    print(classification_report(y_test, y_pred))
    matrix = confusion_matrix(y_test, y_pred)
    methods.saveConfusionMatrix(matrix, ('confusion_matrix_randomforest_distances_' + str(counter) +'.png'))
    counter = counter+1

meanAcc= round(np.mean(np.asarray(metrics['accuracy'])),2)*100
print('meanAcc: ', meanAcc)

Is this a reasonable approach or do I have something completely wrong?

EDIT:

I just tested the following:

gs = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)

This yields best_score = 0.5362903225806451 at best_index = 28. When I check the accuracies in the 3 folds at index 28 I get:

  1. split0: 0.5185929648241207
  2. split1: 0.526686807653575
  3. split2: 0.5637651821862348

Which leads to the mean test accuracy: 0.5362903225806451. best_params: {'criterion': 'entropy', 'max_depth': 21, 'min_samples_leaf': 5}

Now I run this code which is using the mentioned best_params with a stratified 3 fold split (like GridSearchCV):

rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, max_depth=21, criterion='entropy', random_state=42)
accuracy = []
metrics = {'accuracy':[], 'precision':[], 'recall':[], 'fscore':[], 'support':[]}
counter = 0
print('################################################### RandomForest_Gini ###################################################')
for train_index, test_index in skf.split(X_Distances,Y):
    X_train, X_test = X_Distances[train_index], X_Distances[test_index]
    y_train, y_test = Y[train_index], Y[test_index]
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)

    precision, recall, fscore, support = np.round(score(y_test, y_pred))
    metrics['accuracy'].append(accuracy_score(y_test, y_pred))
    metrics['precision'].append(precision)
    metrics['recall'].append(recall)
    metrics['fscore'].append(fscore)
    metrics['support'].append(support)

    print(classification_report(y_test, y_pred))
    matrix = confusion_matrix(y_test, y_pred)
    methods.saveConfusionMatrix(matrix, ('confusion_matrix_randomforest_distances_' + str(counter) +'.png'))
    counter = counter+1

meanAcc= np.mean(np.asarray(metrics['accuracy']))
print('meanAcc: ', meanAcc)

The metrics dictionairy yields the exact same accuracies (split0: 0.5185929648241207, split1: 0.526686807653575, split2: 0.5637651821862348)

However the mean calculation is a bit off: 0.5363483182213101

Upvotes: 2

Views: 622

Answers (1)

pilu
pilu

Reputation: 800

While this seems like a promising approach, you are taking a risk: You are tuning, and then evaluating the results of this tuning using the same dataset.

While in some cases this is a legit approach, I would carefully check the difference between the metric you get at the end, and the reported best_score. If these are far off, you should tune your model only on the training set (you are now tuning using everything). In practice, this means performing the split beforehand and making sure that GridSearchCV does not see the test set.

This could be done like this:

train_x, train_y, val_x, val_y = train_test_split(X_distances, Y, test_size=0.3, random_state=42)

You would then run tuning and training on the train_x, train_y.

On the other hand, if the two scores are close, I guess you are good to go.

Upvotes: 3

Related Questions