Kulbear
Kulbear

Reputation: 901

scikit-learn GridSearchCV does not work properly with random forest

I have a grid search implementation for random forest models.

train_X, test_X, train_y, test_y = train_test_split(features, target, test_size=.10, random_state=0)
# A bit performance gains can be obtained from standarization
train_X, test_X = standarize(train_X, test_X)

tuned_parameters = [{
    'n_estimators': [5],
    'criterion': ['mse', 'mae'],
    'random_state': [0]
}]

scores = ['neg_mean_squared_error', 'neg_mean_absolute_error']
for n_fold in [5]:
    for score in scores:
        print("# Tuning hyper-parameters for %s with %d-fold" % (score, n_fold))
        start_time = time.time()
        print()

        # TODO: RandomForestRegressor
        clf = GridSearchCV(RandomForestRegressor(verbose=2), tuned_parameters, cv=n_fold,
                           scoring=score, verbose=2, n_jobs=-1)
        clf.fit(train_X, train_y)
        ... Rest omitted

Before I use it for this grid search, I have used the exact same dataset for many other tasks, so there should not be any problem with the data. In addition, for the test purpose, I first use LinearRegression to see if the entire pipeline goes smoothly, it works. Then I switch to RandomForestRegressor and set a very small number of estimators to test it again. A very strange thing happen them, I'll attach the verbose information. There is a very significant decrease in performance and I don't know what happened. There is no reason to spend 30 minute+ for running one small grid search.

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
building tree 2 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.0s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.0s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.0s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.0s finished
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.8s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total=   5.3s
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.8s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total=   5.3s
building tree 1 of 5
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.9s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total=   5.3s
building tree 1 of 5
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.9s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total=   5.3s
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
building tree 2 of 5
building tree 3 of 5
building tree 4 of 5
building tree 5 of 5
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    5.3s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.5s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total=   5.6s
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5

The above log is printed in a few second, then things seem to be stucked start here...

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  7.4min remaining:    0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  7.5min remaining:    0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  7.5min remaining:    0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  7.8min remaining:    0.0s
building tree 2 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5

It cost more than 20 minutes for these lines.

BTW, for each GridSearchCV run, linear regression cost less than 1 sec.

Do you have any idea why the performance decrease that much?

Any suggestion and comment are appreciated. Thank you.

Upvotes: 0

Views: 1644

Answers (1)

Bert Kellerman
Bert Kellerman

Reputation: 1629

Try setting max_depth for the RandomForestRegressor. This should reduce fitting time. By default max_depth=None.

For example:

tuned_parameters = [{
    'n_estimators': [5],
    'criterion': ['mse', 'mae'],
    'random_state': [0],
    'max_depth': [4],
}]

Edit: Also, by default RandomForestRegressor has n_jobs=1. It will build one tree at a time with this setting. Try setting n_jobs=-1.

In addition, instead of looping over the scoring parameters to GridSearchCV, you can specify multiple metrics. When doing so, you must also specify the metric you want to GridSearchCV to select on as the value of refit. Then, you can access all scores in the cv_results_ dictionary after the fit.

    clf = GridSearchCV(RandomForestRegressor(verbose=2),tuned_parameters, 
                       cv=n_fold, scoring=scores, refit='neg_mean_squared_error',
                       verbose=2, n_jobs=-1)

    clf.fit(train_X, train_y)
    results = clf.cv_results_
    print(np.mean(results['mean_test_neg_mean_squared_error']))
    print(np.mean(results['mean_test_neg_mean_absolute_error']))

http://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#sphx-glr-auto-examples-model-selection-plot-multi-metric-evaluation-py

Upvotes: 1

Related Questions