How is xgboost cover calculated?

Question

Could someone explain how the Cover column in the xgboost R package is calculated in the xgb.model.dt.tree function?

In the documentation it says that Cover "is a metric to measure the number of observations affected by the split".

When you run the following code, given in the xgboost documentation for this function, Cover for node 0 of tree 0 is 1628.2500.

data(agaricus.train, package='xgboost')

#Both dataset are list with two items, a sparse matrix and labels
#(labels = outcome column which will be learned).
#Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train

bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
               eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")

#agaricus.test$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.model.dt.tree(agaricus.train$data@Dimnames[[2]], model = bst)

There are 6513 observations in the train dataset, so can anyone explain why Cover for node 0 of tree 0 is a quarter of this number (1628.25)?

Also, Cover for the node 1 of tree 1 is 788.852 - how is this number calculated?

Any help would be much appreciated. Thanks.

T. Scharf · Accepted Answer

Cover is defined in xgboost as:

the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be

https://github.com/dmlc/xgboost/blob/f5659e17d5200bd7471a2e735177a81cb8d3012b/R-package/man/xgb.plot.tree.Rd Not particularly well documented....

In order to calculate the cover, we need to know the predictions at that point in the tree, and the 2nd derivative with respect to the loss function.

Lucky for us, the prediction for every data point (6513 of them) in the 0-0 node in your example is .5. This is a global default setting whereby your first prediction at t=0 is .5.

base_score [ default=0.5 ] the initial prediction score of all instances, global bias

http://xgboost.readthedocs.org/en/latest/parameter.html

The gradient of binary logistic (which is your objective function) is p-y, where p = your prediction, and y = true label.

Thus, The hessian (which we need for this) is p*(1-p). Note: the Hessian can be determined without y, the true labels.

So (bringing it home) :

6513 * (.5) * (1 - .5) = 1628.25

In the second tree, the predictions at that point are no longer all .5,sp lets get the predictions after one tree

p = predict(bst,newdata = train$data, ntree=1)

head(p)
[1] 0.8471184 0.1544077 0.1544077 0.8471184 0.1255700 0.1544077

sum(p*(1-p))  # sum of the hessians in that node,(root node has all data)
[1] 788.8521

Note , for linear (squared error) regression the hessian is always one, so the cover indicates how many examples are in that leaf.

The big takeaway is that cover is defined by the hessian of the objective function. Lots of info out there in terms of getting to the gradient, and hessian of the binary logistic function.

These slides are helpful is seeing why he uses hessians as a weighting, and also explain how xgboost splits differently from standard trees. https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

How is xgboost cover calculated?

Answers (1)

Related Questions