Reputation: 2017
I was trying Random Forest Algorithm on Boston dataset to predict the house prices medv
with the help of sklearn's RandomForestRegressor
.In all I tried 3 iterations
as below
Iteration 1: Using the model with default hyperparameters
#1. import the class/model
from sklearn.ensemble import RandomForestRegressor
#2. Instantiate the estimator
RFReg = RandomForestRegressor(random_state = 1, n_jobs = -1)
#3. Fit the model with data aka model training
RFReg.fit(X_train, y_train)
#4. Predict the response for a new observation
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
Results of Iteration 1
{'RMSE Test': 2.9850839211419435, 'RMSE Train': 1.2291604936401441}
Iteration 2: I used RandomizedSearchCV to get optimum values of hyper-parameters
from sklearn.ensemble import RandomForestRegressor
RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1)
param_grid = {
'max_features' : ["auto", "sqrt", "log2"],
'min_samples_split' : np.linspace(0.1, 1.0, 10),
'max_depth' : [x for x in range(1,20)]
from sklearn.model_selection import RandomizedSearchCV
CV_rfc = RandomizedSearchCV(estimator=RFReg, param_distributions =param_grid, n_jobs = -1, cv= 10, n_iter = 50)
CV_rfc.fit(X_train, y_train)
So I got the best hyperparameters as follows
CV_rfc.best_params_
#{'min_samples_split': 0.1, 'max_features': 'auto', 'max_depth': 18}
CV_rfc.best_score_
#0.8021713812777814
So I trained a new model with best hyperparameters as below
#1. import the class/model
from sklearn.ensemble import RandomForestRegressor
#2. Instantiate the estimator
RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1, min_samples_split = 0.1, max_features = 'auto', max_depth = 18)
#3. Fit the model with data aka model training
RFReg.fit(X_train, y_train)
#4. Predict the response for a new observation
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
Results of Iteration 2
{'RMSE Test': 3.2836794902147926, 'RMSE Train': 2.71230367772569}
Iteration 3: I use GridSearchCV to get optimum values of hyper-parameters
from sklearn.ensemble import RandomForestRegressor
RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1)
param_grid = {
'max_features' : ["auto", "sqrt", "log2"],
'min_samples_split' : np.linspace(0.1, 1.0, 10),
'max_depth' : [x for x in range(1,20)]
}
from sklearn.model_selection import GridSearchCV
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10, n_jobs = -1)
CV_rfc.fit(X_train, y_train)
So I got the best hyperparameters as follows
CV_rfc.best_params_
#{'max_depth': 12, 'max_features': 'auto', 'min_samples_split': 0.1}
CV_rfc.best_score_
#0.8021820114800677
Results of Iteration 3
{'RMSE Test': 3.283690568225705, 'RMSE Train': 2.712331014201783}
My Function to evaluate RMSE
def model_evaluate(y_train, y_test, y_pred, y_pred_train):
metrics = {}
#RMSE Test
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))
#RMSE Train
rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
metrics = {
'RMSE Test': rmse_test,
'RMSE Train': rmse_train}
return metrics
So I had below questions after 3 iterations
RandomSearchCV
and GridSearchCV
. Ideally the model should give good results when tuned with cross-validationparam_grid
.There could be values which are good but not included in my param_grid
. So how do I deal with this kind of situationmax_features
, min_samples_split
, max_depth
or for that matter any hyper-parameters in a machine learning model to increase its accuracy.(So that I can at least get a better tuned model than the model with default hyper-parameters)Upvotes: 5
Views: 8725
Reputation: 5906
Why are the results of tuned model worst than the model with default parameters even when I am using RandomSearchCV and GridSearchCV. Ideally the model should give good results when tuned with cross-validation
Your second question kind of answers your first one, but I tried to reproduce your results on the boston dataset, I got {'test_rmse':3.987, 'train_rmse':1.442}
with default parameters, {'test_rmse':3.98, 'train_rmse':3.426}
for 'tuned' parameters with random search and {'test_rmse':3.993, 'train_rmse':3.481}
with grid search. Then I used hyperopt
with following parameter space
{'max_depth': hp.choice('max_depth', range(1, 100)),
'max_features': hp.choice('max_features', range(1, x_train.shape[1])),
'min_samples_split': hp.uniform('min_samples_split', 0.1, 1)}
After about 200 runs results looked like this,
so I widened the space to
'min_samples_split', 0.01, 1
which got me the best result of {'test_rmse':3.278, 'train_rmse':1.716}
with min_samples_split
equal to 0.01. According to documentation the formula for min_samples_split
is ceil(min_samples_split * n_samples)
which in our case gives np.ceil(0.1 * len(x_train))
=34 which could be kind of big for a small dataset like this.
I know that cross-validation will take place only for the combination of values present in param_grid.There could be values which are good but not included in my param_grid. So how do I deal with this kind of situation
How do I decide what range of values I should try for max_features, min_samples_split, max_depth or for that matter any hyper-parameters in a machine learning model to increase its accuracy.(So that I can atleast get a better tuned model than the model with default hyper-parameters)
You can't know this in advance, so you have to do research for each algorithm to see what kind of parameter spaces are usually searched (good source for this is kaggle, e.g. google kaggle kernel random forest
), merge them, account for your dataset features and optimize over them using some kind of Bayesian Optimization algorithm (there are multiple existing libraries for this) which try to optimally select for a new parameter value to choose.
Upvotes: 3