Tegel
Tegel

Reputation: 1

Fixing Overfitting for a Random Forest/Gradient Boosting Regression through GridSearch

I'm working on a Regression through the Random Forest and Gradient Boosting Algorithms in Python.

The code is running and the results look promising but i have a problem with overfitting.

GridSearchCV should prevent overfitting, because it uses Cross-validation to find the best Parameters for the Models. If I use GridSearch to and then use the optimized Model to calculate the Validation- and Trainingserror for the Dataset the Difference between those becomes so large that it seems to indicate overfitting.

The dataset I'm working with has 1500 instances and 5 features.

How to fix this issue?

The current Code for the Random Forest is:

import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split
import numpy as np
import csv
import matplotlib.pyplot as plt
airfoil_self_noise = fetch_ucirepo(id=291)
X = airfoil_self_noise.data.features
y = airfoil_self_noise.data.targets
randomstate = 0
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=randomstate) 
rfr = RandomForestRegressor(random_state=randomstate)
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [2, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 5, 10]
}
rfr_cv = GridSearchCV(estimator=rfr, param_grid=param_grid, cv=10, scoring='neg_root_mean_squared_error', n_jobs=4)
rfr_cv.fit(X_train, y_train.values.ravel())
print(rfr_cv.best_params_)
print(abs(rfr_cv.best_score_))
y_pred_train = rfr_cv.predict(X_train)
y_pred_test = rfr_cv.predict(X_test)
train_rmse = root_mean_squared_error(y_train, y_pred_train)
test_rmse = root_mean_squared_error(y_test, y_pred_test)
print(f"Test RMSE: {test_rmse}")
print(f"Train RMSE: {train_rmse}")
print(f"Difference: {test_rmse - train_rmse}")

The "Difference" is the problem cause it seems to be way to high for a model that should generalize well.

The Code for Gradient Boosting is:

import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import root_mean_squared_error,make_scorer
from sklearn.model_selection import GridSearchCV, train_test_split
import numpy as np
import csv
import matplotlib.pyplot as plt
airfoil_self_noise = fetch_ucirepo(id=291)
X = airfoil_self_noise.data.features
y = airfoil_self_noise.data.targets
randomstate = 0
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=randomstate)
GBR = GradientBoostingRegressor(random_state=randomstate)
rmse_scorer = make_scorer(root_mean_squared_error, greater_is_better=False)
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'learning_rate': [.001,.005,.01],
    'max_depth': [3, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 5, 10],
    'subsample': [0.5, 0.8, 1.0]
}
GBR2 = GridSearchCV(estimator=GBR, param_grid=param_grid, cv=10, scoring=rmse_scorer, n_jobs=4)
GBR2.fit(X_train, y_train.values.ravel())
print(GBR2.best_params_)
print(abs(GBR2.best_score_))
y_pred_train = GBR2.predict(X_train)
y_pred_test = GBR2.predict(X_test)
test_rmse= root_mean_squared_error(y_test,y_pred_test)
train_rmse = root_mean_squared_error(y_train, y_pred_train)
print(f"Test RMSE: {test_rmse}")
print(f"Train RMSE: {train_rmse}")
print(f"Difference: {test_rmse - train_rmse}")

For the Gradient Boosting I noticed manually lowering the learning rate reduces the Difference to a Size where it doesn't indicate overfitting, but has a much higher RMSE for Training and Test data.

I also tried adding early stopping and L1 and L2 regularization for better results but that didnt change the results that much

Upvotes: -3

Views: 124

Answers (1)

Ricardo Vargas
Ricardo Vargas

Reputation: 1

As you know, RMSE on the training set alone is not sufficient. RandomForest can prevent overfitting through pruning. As the documentation states, using ccp_alpha > 0 (the default value is 0.0) achieves this. After pruning, the training metric should decrease, but the test metric is expected to improve as the model generalizes better. The GradientBoostingRegressor uses the same parameter for pruning.

Another way to enhance model generalization is to decrease the number of estimators (n_estimators in both models).

References:

Upvotes: 0

Related Questions