XGBoost decision tree selection

Question

I have a question regarding which decision tree should I choose from XGBoost.

I will use the following code as an example.

#import packages
import xgboost as xgb
import matplotlib.pyplot as plt

# create DMatrix
df_dmatrix = xgb.DMatrix(data = X, label = y)

# set up parameter dictionary
params = {"objective":"reg:linear", "max_depth":2}

#train the model
xg_reg = xgb.train(params = params, dtrain = df_dmatrix, num_boost_round = 10)

#plot the tree
xgb.plot_tree(xg_reg, num_trees = n) # my question related to here

I create 10 trees in the xg_reg model, and I can plot any one of them by setting n in my last code equal to the index of the tree.

My question is: how can I know which tree best explains the dataset? Is it always the last one? Or should I determine which features I want to include in the tree, and then choose the tree which contains the features?

Alessandro Solbiati · Accepted Answer

My question is how I can know which tree explains the data set best?

XGBoost is an implementation of Gradient Boosted Decision Trees (GBDT). Roughly speaking, GBDT is a sequence of trees each one improving the prediction of the previous using residual boosting. So the tree that explains the data best is the n - 1th.

You can read more about GBDT here

Or should I determine which features I want to include in the tree, and then choose the tree which contains the features?

All the trees are trained with the same base features, they just get residuals added at every boosting iteration. So you could not determine the best tree in this way. In this video there is an intuitive explanation of residuals.

XGBoost decision tree selection

Answers (1)

Related Questions