Fixing Overfitting for a Random Forest/Gradient Boosting Regression through GridSearch

Question

I'm working on a Regression through the Random Forest and Gradient Boosting Algorithms in Python.

The code is running and the results look promising but i have a problem with overfitting.

GridSearchCV should prevent overfitting, because it uses Cross-validation to find the best Parameters for the Models. If I use GridSearch to and then use the optimized Model to calculate the Validation- and Trainingserror for the Dataset the Difference between those becomes so large that it seems to indicate overfitting.

The dataset I'm working with has 1500 instances and 5 features.

How to fix this issue?

The current Code for the Random Forest is:

import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split
import numpy as np
import csv
import matplotlib.pyplot as plt
airfoil_self_noise = fetch_ucirepo(id=291)
X = airfoil_self_noise.data.features
y = airfoil_self_noise.data.targets
randomstate = 0
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=randomstate) 
rfr = RandomForestRegressor(random_state=randomstate)
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [2, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 5, 10]
}
rfr_cv = GridSearchCV(estimator=rfr, param_grid=param_grid, cv=10, scoring='neg_root_mean_squared_error', n_jobs=4)
rfr_cv.fit(X_train, y_train.values.ravel())
print(rfr_cv.best_params_)
print(abs(rfr_cv.best_score_))
y_pred_train = rfr_cv.predict(X_train)
y_pred_test = rfr_cv.predict(X_test)
train_rmse = root_mean_squared_error(y_train, y_pred_train)
test_rmse = root_mean_squared_error(y_test, y_pred_test)
print(f"Test RMSE: {test_rmse}")
print(f"Train RMSE: {train_rmse}")
print(f"Difference: {test_rmse - train_rmse}")

The "Difference" is the problem cause it seems to be way to high for a model that should generalize well.

The Code for Gradient Boosting is:

import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import root_mean_squared_error,make_scorer
from sklearn.model_selection import GridSearchCV, train_test_split
import numpy as np
import csv
import matplotlib.pyplot as plt
airfoil_self_noise = fetch_ucirepo(id=291)
X = airfoil_self_noise.data.features
y = airfoil_self_noise.data.targets
randomstate = 0
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=randomstate)
GBR = GradientBoostingRegressor(random_state=randomstate)
rmse_scorer = make_scorer(root_mean_squared_error, greater_is_better=False)
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'learning_rate': [.001,.005,.01],
    'max_depth': [3, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 5, 10],
    'subsample': [0.5, 0.8, 1.0]
}
GBR2 = GridSearchCV(estimator=GBR, param_grid=param_grid, cv=10, scoring=rmse_scorer, n_jobs=4)
GBR2.fit(X_train, y_train.values.ravel())
print(GBR2.best_params_)
print(abs(GBR2.best_score_))
y_pred_train = GBR2.predict(X_train)
y_pred_test = GBR2.predict(X_test)
test_rmse= root_mean_squared_error(y_test,y_pred_test)
train_rmse = root_mean_squared_error(y_train, y_pred_train)
print(f"Test RMSE: {test_rmse}")
print(f"Train RMSE: {train_rmse}")
print(f"Difference: {test_rmse - train_rmse}")

For the Gradient Boosting I noticed manually lowering the learning rate reduces the Difference to a Size where it doesn't indicate overfitting, but has a much higher RMSE for Training and Test data.

I also tried adding early stopping and L1 and L2 regularization for better results but that didnt change the results that much

Fixing Overfitting for a Random Forest/Gradient Boosting Regression through GridSearch

Answers (1)

Related Questions