Random forest getting mse by tuning two hyperparameters using a for loop

Question

I'm developping a model to predict the target variable using the RandomForestRegressor from scikit.

I have developped a function to get the mse as below:

def get_mse(n_estimators, max_leaf_nodes, X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=n_estimators, max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    preds_val = model.predict(X_valid)
    mse = mean_squared_error(y_valid, preds_val, squared = False)
    return(mse)

I would like to use a for loop to get the best mse scores by combining a list of values for n_estimators and max_leaf_nodes

Below are the code that I wrote:

n_estimators = [100,150,200,250]
max_leaf_nodes = [10, 50, 100, 200]

for n_estimators,max_leaf_nodes in zip(n_estimators,max_leaf_nodes):
    my_mse = get_mse(n_estimators,max_leaf_nodes, X_train, X_valid, y_train, y_valid)
    print("N_estimators: %d  		 Max leaf nodes: %d  		 Mean Squared Error:  %d" %(n_estimators, max_leaf_nodes, my_mse))

But when I run this for loop, it always return a mse of 0 for each combination of two hyperparameters.

I have tried my function by using the following code and it returns with the correct mse:

get_mse(200, 100, X_train, X_valid, y_train, y_valid)

I'm wondering why my for loop is not working properly by returning me always a 0 mse.

Could someone can help me to solve this issue ?

Thank you

afsharov · Accepted Answer

There are mainly two things to consider:

First, do not shadow the names you already used to declare the list of values (n_estimators and max_leaf_nodes). Instead, make them clearly distinguishable:

n_estimators_list = [100, 150, 200, 250]
max_leaf_nodes_list = [10, 50, 100, 200]

for n_estimators, max_leaf_nodes in zip(n_estimators_list, max_leaf_nodes_list):
...

Secondly, as pointed out in the comments above, you should replace the %d formatter for mse with %f since values between 0 and 1 would otherwise be formatted as 0:

print("N_estimators: %d  		 Max leaf nodes: %d  		 Mean Squared Error:  %f" %(n_estimators, max_leaf_nodes, my_mse))

Personally, I would recommend using one of the newer string formatting options, for example Python 3's f-strings, to avoid such mishaps:

print(f"N_estimators: {n_estimators}  		 Max leaf nodes: {max_leaf_nodes}  		 Mean Squared Error:  {my_mse}")

A last note that has also been already mentioned in the comments: for hyperparameter tuning, you could use GridSearchCV which is a pre-implemented functionality to find the best hyperparameters using an exhaustive search over a pre-defined grid. Example usage:

param_grid = {
   'n_estimators': [100, 150, 200, 250],
   'max_leaf_nodes': [10, 50, 100, 200]
}

gs = GridSearchCV(
    estimator=RandomForestRegressor(),
    param_grid=param_grid,
    scoring='neg_root_mean_squared_error'
)

gs.fit(X, y)
print(gs.best_params_)

The advantage is that this implementation is battle-proven, provides many readily available values and statistics to inspect the result, and uses cross-validation. Furthermore, it will explore all possible hyperparameter combinations (in contrast to your own loop which does not).

You can read more about GridSearchCV in its documentation.

Random forest getting mse by tuning two hyperparameters using a for loop

Answers (1)

Related Questions