user8270077
user8270077

Reputation: 5071

Interpreting features created by xgb.create.features() function in xgboost in R

How can I interpret the features created by the xgb.create.features() in the xgboost package in R?

Here is a reproducible example:

library(xgboost)

data(mtcars)
X = as.matrix(mtcars[, -9])
dtrain = xgb.DMatrix(data = X, label = Y)

model = xgb.train(data = dtrain, 
                  eval = "auc",
                  verbose =0,  maximize = TRUE, 
                  params = list(objective = "binary:logistic",
                                eta = 0.1,
                                max_depth = 6,
                                subsample = 0.8,
                                lambda = 0.1 ), 
                  nrounds = 10)

dtrain1 = xgb.create.features(model, X)
colnames(dtrain1)

'mpg' 'cyl' 'disp' 'hp' 'drat' 'wt' 'qsec' 'vs' 'gear' 'carb' 'V13' 'V14' 'V15' 'V16' 'V23' 'V24' 'V33' 'V34' 'V43' 'V44' 'V53' 'V54' 'V63' 'V64' 'V73' 'V74' 'V83' 'V84' 'V93' 'V94' 'V103' 'V104'

new_data = as.matrix(dtrain1)
new_data = data.frame(new_data)
head(new_data)

enter image description here

Upvotes: 2

Views: 866

Answers (1)

missuse
missuse

Reputation: 19716

You fitted 10 trees. These 10 trees have as many leaves as there are columns V13 - V104. These leaves are your new variables.

Lets assume the first tree had 4 leaves and that Observation Mazda RX4 fell into the 2nd leaf, it would be encoded 0, 1, 0, 0. The variables corresponding to this would be V13, V14, V15, V16. The same for the second tree and so forth.

You can conclude based on the variable names which columns correspond to which trees:
'V13' 'V14' 'V15' 'V16' - first tree
'V23' 'V24' - second tree
'V103' 'V104' - 10th tree

As explained in the help for the function:

We found that boosted decision trees are a powerful and very convenient way to implement non-linear and tuple transformations of the kind we just described. We treat each individual tree as a categorical feature that takes as value the index of the leaf an instance ends up falling in. We use 1-of-K coding of this type of features.

For example, consider the boosted tree model in Figure 1 with 2 subtrees, where the first subtree has 3 leafs and the second 2 leafs. If an instance ends up in leaf 2 in the first subtree and leaf 1 in second subtree, the overall input to the linear classifier will be the binary vector \code{[0, 1, 0, 1, 0]}, where the first 3 entries correspond to the leaves of the first subtree and last 2 to those of the second subtree.

Do note that this variable set warrants another hyper parameter tune and is prone to over-fitting. Check out feature importance for the model before and after xgb.create.features.

Upvotes: 2

Related Questions