Reputation: 806
Could someone explain how the Cover
column in the xgboost
R package is calculated in the xgb.model.dt.tree
function?
In the documentation it says that Cover "is a metric to measure the number of observations affected by the split".
When you run the following code, given in the xgboost
documentation for this function, Cover
for node 0 of tree 0 is 1628.2500.
data(agaricus.train, package='xgboost')
#Both dataset are list with two items, a sparse matrix and labels
#(labels = outcome column which will be learned).
#Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
#agaricus.test$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.model.dt.tree(agaricus.train$data@Dimnames[[2]], model = bst)
There are 6513 observations in the train dataset, so can anyone explain why Cover
for node 0 of tree 0 is a quarter of this number (1628.25)?
Also, Cover
for the node 1 of tree 1 is 788.852 - how is this number calculated?
Any help would be much appreciated. Thanks.
Upvotes: 14
Views: 8258
Reputation: 4834
Cover is defined in xgboost
as:
the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be
https://github.com/dmlc/xgboost/blob/f5659e17d5200bd7471a2e735177a81cb8d3012b/R-package/man/xgb.plot.tree.Rd Not particularly well documented....
In order to calculate the cover, we need to know the predictions at that point in the tree, and the 2nd derivative with respect to the loss function.
Lucky for us, the prediction for every data point (6513 of them) in the 0-0 node in your example is .5. This is a global default setting whereby your first prediction at t=0 is .5.
base_score [ default=0.5 ] the initial prediction score of all instances, global bias
http://xgboost.readthedocs.org/en/latest/parameter.html
The gradient of binary logistic (which is your objective function) is p-y, where p = your prediction, and y = true label.
Thus, The hessian (which we need for this) is p*(1-p). Note: the Hessian can be determined without y, the true labels.
So (bringing it home) :
6513 * (.5) * (1 - .5) = 1628.25
In the second tree, the predictions at that point are no longer all .5,sp lets get the predictions after one tree
p = predict(bst,newdata = train$data, ntree=1)
head(p)
[1] 0.8471184 0.1544077 0.1544077 0.8471184 0.1255700 0.1544077
sum(p*(1-p)) # sum of the hessians in that node,(root node has all data)
[1] 788.8521
Note , for linear (squared error) regression the hessian is always one, so the cover indicates how many examples are in that leaf.
The big takeaway is that cover is defined by the hessian of the objective function. Lots of info out there in terms of getting to the gradient, and hessian of the binary logistic function.
These slides are helpful is seeing why he uses hessians as a weighting, and also explain how xgboost
splits differently from standard trees. https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
Upvotes: 27