Heng
Heng

Reputation: 41

Random forest getting mse by tuning two hyperparameters using a for loop

I'm developping a model to predict the target variable using the RandomForestRegressor from scikit.

I have developped a function to get the mse as below:

def get_mse(n_estimators, max_leaf_nodes, X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=n_estimators, max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(X_train, y_train)
    preds_val = model.predict(X_valid)
    mse = mean_squared_error(y_valid, preds_val, squared = False)
    return(mse)

I would like to use a for loop to get the best mse scores by combining a list of values for n_estimators and max_leaf_nodes

Below are the code that I wrote:

n_estimators = [100,150,200,250]
max_leaf_nodes = [10, 50, 100, 200]

for n_estimators,max_leaf_nodes in zip(n_estimators,max_leaf_nodes):
    my_mse = get_mse(n_estimators,max_leaf_nodes, X_train, X_valid, y_train, y_valid)
    print("N_estimators: %d  \t\t Max leaf nodes: %d  \t\t Mean Squared Error:  %d" %(n_estimators, max_leaf_nodes, my_mse))

But when I run this for loop, it always return a mse of 0 for each combination of two hyperparameters.

I have tried my function by using the following code and it returns with the correct mse:

get_mse(200, 100, X_train, X_valid, y_train, y_valid)

I'm wondering why my for loop is not working properly by returning me always a 0 mse.

Could someone can help me to solve this issue ?

Thank you

Upvotes: 1

Views: 544

Answers (1)

afsharov
afsharov

Reputation: 5164

There are mainly two things to consider:

First, do not shadow the names you already used to declare the list of values (n_estimators and max_leaf_nodes). Instead, make them clearly distinguishable:

n_estimators_list = [100, 150, 200, 250]
max_leaf_nodes_list = [10, 50, 100, 200]

for n_estimators, max_leaf_nodes in zip(n_estimators_list, max_leaf_nodes_list):
...

Secondly, as pointed out in the comments above, you should replace the %d formatter for mse with %f since values between 0 and 1 would otherwise be formatted as 0:

print("N_estimators: %d  \t\t Max leaf nodes: %d  \t\t Mean Squared Error:  %f" %(n_estimators, max_leaf_nodes, my_mse))

Personally, I would recommend using one of the newer string formatting options, for example Python 3's f-strings, to avoid such mishaps:

print(f"N_estimators: {n_estimators}  \t\t Max leaf nodes: {max_leaf_nodes}  \t\t Mean Squared Error:  {my_mse}")

A last note that has also been already mentioned in the comments: for hyperparameter tuning, you could use GridSearchCV which is a pre-implemented functionality to find the best hyperparameters using an exhaustive search over a pre-defined grid. Example usage:

param_grid = {
   'n_estimators': [100, 150, 200, 250],
   'max_leaf_nodes': [10, 50, 100, 200]
}

gs = GridSearchCV(
    estimator=RandomForestRegressor(),
    param_grid=param_grid,
    scoring='neg_root_mean_squared_error'
)

gs.fit(X, y)
print(gs.best_params_)

The advantage is that this implementation is battle-proven, provides many readily available values and statistics to inspect the result, and uses cross-validation. Furthermore, it will explore all possible hyperparameter combinations (in contrast to your own loop which does not).

You can read more about GridSearchCV in its documentation.

Upvotes: 2

Related Questions