Reputation: 243
I am a novice in statistical methods so please excuse any naivety. I am having a problem understanding the execution of cross validation when using Decision tree regression from sklearn (e.g. DecisionTreeRegressor and RandomForestRegressor). My dataset varies from having multiple predictors (y = single dependent variable; X = multiple independent variables) to having a single predictor and consists of enough cases (> 10k). The following explanation applies for all cases.
When fitting and scoring the regressors with the standard methods:
dt = DecisionTreeRegressor()
rf = RandomForestRegressor()
dt.fit(X,y)
rf.fit(X,y)
dt_score = dt.score(X,y)
rf_score = rf.score(X,y)
The dt_score and rf_score returns promising R-squared values (> 0.7), however I am aware of the over-fitting properties of the DT and to lesser extent the RF. Therefore I tried to score the regressors with cross-validation (10 fold) to get a more true representation of the accuracy:
dt = DecisionTreeRegressor()
rf = RandomForestRegressor()
dt.fit(X,y)
rf.fit(X,y)
dt_scores = cross_val_score(dt, X, y, cv = 10)
rf_scores = cross_val_score(rf, X, y, cv = 10)
dt_score = round(sum(dt_scores )/len(dt_scores ), 3)
rf_score = round(sum(rf_scores )/len(rf_scores ), 3)
The results of this cross validation always returns negative values. I assume they are R squared values according to the sklearn guidlines: By default, the score computed at each CV iteration is the score method of the estimator (the score method of both the regressors is R squared). The explanation given from the guidelines for the basic KFold cross validation is: Each fold is then used once as a validation while the k - 1 remaining folds form the training set.
How I understand this, when using 10 old cv, is: my dataset is split into 10 equal parts, for each part the remaining 9 parts are used for training (I am not sure if this is a fit operation or a score operation) and the remaining part is used for validation (not sure what is done for validation). These regressors are a complete "black box" for me, so I have no idea on how a tree is used for regression and where the cross validation gets its R square values from.
So to summarize, I am struggling to understand how the cross validation can decrease the accuracy (R squared) so dramatically? Am I using the cross validation right for a regressor? Does it make sense to use cross validation for a decision tree regressor? Should I be using another cross-validation method?
Thank you
Upvotes: 5
Views: 21715
Reputation: 1411
Have put together a small code-snippet articulating how on using DecisionTreeRegressor and cross-validation.
A. In the first code-snippet 'cross_val_score' is used. But, r2_score might have negative score, giving insight about the poor learning by the model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20, random_state=0)
dt = DecisionTreeRegressor(random_state=0, criterion="mae")
dt_fit = dt.fit(X_train, y_train)
dt_scores = cross_val_score(dt_fit, X_train, y_train, cv = 5)
print("mean cross validation score: {}".format(np.mean(dt_scores)))
print("score without cv: {}".format(dt_fit.score(X_train, y_train)))
# on the test or hold-out set
from sklearn.metrics import r2_score
print(r2_score(y_test, dt_fit.predict(X_test)))
print(dt_fit.score(X_test, y_test))
B. In this next section, using cross-validation for performing GridSerach on the parameter 'min_samples_split', then using the best estimator for scoring on the valiation/holdout set. # Using GridSearch: from sklearn.model_selection import GridSearchCV from sklearn.metrics import make_scorer from sklearn.metrics import mean_absolute_error from sklearn.metrics import r2_score
scoring = make_scorer(r2_score)
g_cv = GridSearchCV(DecisionTreeRegressor(random_state=0),
param_grid={'min_samples_split': range(2, 10)},
scoring=scoring, cv=5, refit=True)
g_cv.fit(X_train, y_train)
g_cv.best_params_
result = g_cv.cv_results_
# print(result)
r2_score(y_test, g_cv.best_estimator_.predict(X_test))
Hoping, this was useful.
https://www.programcreek.com/python/example/75177/sklearn.cross_validation.cross_val_score
Upvotes: 5
Reputation: 10399
The decision tree splits on values of your features that generates a group with the highest purity. When I say purity, I mean in the sense that all the members in that group shares everything or almost everything that is similar (e.g. all white, age 35, all male, etc.). It will keep doing this until all your leaf nodes are perfectly pure, or certain stopping mechanisms are met (e.g. minimum number of samples in a node required to split). The parameters you'll see in the sklearn documentation are basically those stopping parameters. Now, in terms of regression, what the tree will do is take the average of all true y
of each leaf (the node that doesn't have anymore splits) as the estimated y-hat
for that particular path, so that when you predict your test dataset, each record from that test dataset will basically follow some path down the tree until it hits a leaf node, and the estimated y-hat
for that record will be the average true y
of all observations in that leaf node.
A random forest is basically a collection of decision trees which use a subset of your training data to do the training. These trees are usually not as deep as a single decision tree model, which helps alleviate the overfitting symptoms of a single decision tree. The idea of a RF is that you're using many weak learners that can generalize your data well. Hence, less overfit.
The R-squared metric is basically 1 - (SS_res / SS_tot)
. Breaking that formula down, you're basically looking at the sum of squared residuals and the sum of squared total. Therefore, you just have to know the true y
values, the estimated y-hat
values, and the mean of the true y
values, y-bar
.
Upvotes: 0