Using decision tree regression and cross-validation in sklearn

Question

I am a novice in statistical methods so please excuse any naivety. I am having a problem understanding the execution of cross validation when using Decision tree regression from sklearn (e.g. DecisionTreeRegressor and RandomForestRegressor). My dataset varies from having multiple predictors (y = single dependent variable; X = multiple independent variables) to having a single predictor and consists of enough cases (> 10k). The following explanation applies for all cases.

When fitting and scoring the regressors with the standard methods:

dt = DecisionTreeRegressor()
rf = RandomForestRegressor()

dt.fit(X,y)
rf.fit(X,y)

dt_score = dt.score(X,y)
rf_score = rf.score(X,y)

The dt_score and rf_score returns promising R-squared values (> 0.7), however I am aware of the over-fitting properties of the DT and to lesser extent the RF. Therefore I tried to score the regressors with cross-validation (10 fold) to get a more true representation of the accuracy:

dt = DecisionTreeRegressor()
rf = RandomForestRegressor()

dt.fit(X,y)
rf.fit(X,y)

dt_scores = cross_val_score(dt, X, y, cv = 10)
rf_scores = cross_val_score(rf, X, y, cv = 10) 

dt_score = round(sum(dt_scores )/len(dt_scores ), 3)
rf_score = round(sum(rf_scores )/len(rf_scores ), 3)

The results of this cross validation always returns negative values. I assume they are R squared values according to the sklearn guidlines: By default, the score computed at each CV iteration is the score method of the estimator (the score method of both the regressors is R squared). The explanation given from the guidelines for the basic KFold cross validation is: Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

How I understand this, when using 10 old cv, is: my dataset is split into 10 equal parts, for each part the remaining 9 parts are used for training (I am not sure if this is a fit operation or a score operation) and the remaining part is used for validation (not sure what is done for validation). These regressors are a complete "black box" for me, so I have no idea on how a tree is used for regression and where the cross validation gets its R square values from.

So to summarize, I am struggling to understand how the cross validation can decrease the accuracy (R squared) so dramatically? Am I using the cross validation right for a regressor? Does it make sense to use cross validation for a decision tree regressor? Should I be using another cross-validation method?

Thank you

Pramit · Accepted Answer

Have put together a small code-snippet articulating how on using DecisionTreeRegressor and cross-validation.

A. In the first code-snippet 'cross_val_score' is used. But, r2_score might have negative score, giving insight about the poor learning by the model.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.20, random_state=0)

dt = DecisionTreeRegressor(random_state=0, criterion="mae")
dt_fit = dt.fit(X_train, y_train)

dt_scores = cross_val_score(dt_fit, X_train, y_train, cv = 5)
print("mean cross validation score: {}".format(np.mean(dt_scores)))
print("score without cv: {}".format(dt_fit.score(X_train, y_train)))

# on the test or hold-out set
from sklearn.metrics import r2_score
print(r2_score(y_test, dt_fit.predict(X_test)))
print(dt_fit.score(X_test, y_test))

B. In this next section, using cross-validation for performing GridSerach on the parameter 'min_samples_split', then using the best estimator for scoring on the valiation/holdout set. # Using GridSearch: from sklearn.model_selection import GridSearchCV from sklearn.metrics import make_scorer from sklearn.metrics import mean_absolute_error from sklearn.metrics import r2_score

scoring = make_scorer(r2_score)
g_cv = GridSearchCV(DecisionTreeRegressor(random_state=0),
              param_grid={'min_samples_split': range(2, 10)},
              scoring=scoring, cv=5, refit=True)

g_cv.fit(X_train, y_train)
g_cv.best_params_

result = g_cv.cv_results_
# print(result)
r2_score(y_test, g_cv.best_estimator_.predict(X_test))

Hoping, this was useful.

Reference:

https://www.programcreek.com/python/example/75177/sklearn.cross_validation.cross_val_score

Using decision tree regression and cross-validation in sklearn

Answers (2)

Reference:

Related Questions