Reputation: 1
I have trouble understanding how XGBoost calculates the leaf weights in multi-class classification.
I generated a simple example with two features and 3 classes.
The training data looks like this:
feature 0 | feature 1 | label y | |
---|---|---|---|
x1 | 1 | 1 | 0 |
x2 | 2 | 3 | 1 |
x3 | 3 | 1 | 2 |
From what I have read so far, XGBoost uses a "one vs. all" principle. Therefore, XGBoost creates separate trees for each class. With a setting of 2 boosting rounds, XGBoost creates 6 trees. The first tree corresponds to class 0, the second to class 1, the third to class 2, the fourth to class 0 and so on ...
I used the following python code to plot the first tree trees.
from xgboost import XGBClassifier, plot_tree
import numpy as np
X_train = np.array([[1,1],[2,3],[3,1]])
y_train = np.array([0,1,2])
param = {
'max_depth': 2,
'eta': 0.3,
'objective': 'multi:softprob',
'num_class': 3,
'min_child_weight': 0,
'n_estimators' : 2,
'base_score': 0.5
}
clf = XGBClassifier(**param)
clf.fit(X_train, y_train)
plot_tree(clf, num_trees=0)
plot_tree(clf, num_trees=1)
plot_tree(clf, num_trees=2)
The result looks like this: first tree booster
The question is how the weights of the leaves are calculated?
My thoughts so far:
As I understand it, XGBoost uses the log loss function for binary classification.
L(p,y) = −( ylog(p) + (1−y)log(1−p) )
(where p= 1/(1+exp(-x))))
y is the true label and p is the probability prediction of the model that the sample belongs to class 1.
The gradient and hessian of this loss function are:
gradient: p - y
hessian: p(1-p)
There is a nice explanation of how the ideal weight is calculated for a fixed tree structure (XGBoost Docs). Accordingly, the ideal weight is calculated by:
w_t = - G_t / (H_t + lambda)
where G_t is the sum of gradient statistics of all samples belonging to the leaf. Similarly, H_t is the sum of second-order statistics (the hessian) of all samples belonging to the leaf.
Now I would like to calculate the weights of the first tree. The initial prediction score of all instances shut be 0.5. Since this tree corresponds to class 0 the true labels shut be y1=1, y2=0, y3=0 (since only sample 1 has class 0).
The gradient and hessian statistics are:
g1 = 0.5 - 1 = -0.5
g2 = 0.5 - 0 = 0.5
g3 = 0.5 - 0 = 0.5
h1 = 0.5(1-0.5) = 0.25
h2 = 0.5(1-0.5) = 0.25
h3 = 0.5(1-0.5) = 0.25
The first tree has only one split criterion f0 < 1.5. Therefore sample 1 is assigned to the left leaf and samples 2 and 3 are assigned to the right leaf.
The weights shut be calculated by:
w1 = - g1 / (h1 + lambda) * eta = 0.5 / (0.25 + 1) * 0.3 = 0.12
w2 = - (g2 + g3) / (h2 + h3 + lambda) * eta = -(0.5 +0.5) / (0.25 + 0.25 + 1) * 0.3 = - 0.1998
with eta the learning rate
My results do not match with the result of XGBoost, since w1 and w2 should be approximately 0.1384 and -0.106.
What am I doing wrong?
I think my real problem is that I do not understand the log loss and how the sigmoid function influences the result.
I would be very grateful if someone can help me with this.
Upvotes: 0
Views: 1067
Reputation: 1
I found my mistake, actually, there are two mistakes:
2*p(1-p)
exp(0.5) / (3 * exp(0.5) = 0.33
g1 = 0.33 - 1 = -0.67
g2 = g3 = 0.33 - 0 = 0.33
h1 = h2 = h3 = 2 * 0.33 (1 - 0.33) = 0.4422
w1 = - (-0.67) / (0.4422 + 1 ) * 0.3 = 0.139
w2 = -(0.33 + 0.33) / (0.4422 + 0.4422 + 1) * 0.3 = -0.105
Upvotes: 0