doraemon
doraemon

Reputation: 2502

how are the leaf values of xgboost regression trees relate to the prediction

It seems that the sum of corresponding leaf values of each tree doesn't equal to the prediction. Here is a sample code:

X = pd.DataFrame({'x': np.linspace(-10, 10, 10)})
y = X['x'] * 2
model = xgb.XGBRegressor(booster='gbtree', tree_method='exact', n_estimators=100, max_depth=1).fit(X, y)
Xtest = pd.DataFrame({'x': np.linspace(-20, 20, 101)})
Ytest = model.predict(Xtest)
plt.plot(X['x'], y, 'b.-')
plt.plot(Xtest['x'], Ytest, 'r.')

enter image description here

The tree dumps reads:

model.get_booster().get_dump()[:2]

['0:[x<0] yes=1,no=2,missing=1\n\t1:leaf=-2.90277791\n\t2:leaf=2.65277767\n',
 '0:[x<2.22222233] yes=1,no=2,missing=1\n\t1:leaf=-1.90595233\n\t2:leaf=2.44333339\n']

If I only use one tree to do prediction:

Ytest2 = model.predict(Xtest, ntree_limit=1)
plt.plot(XX1['x'], Ytest2, '.')
np.unique(Ytest2)  # array([-2.4028,  3.1528], dtype=float32)

enter image description here

Clearly, Ytest2's unique values does not corresponds to the leaf value of the first tree, which is -2.90277791 and 2.65277767, although the observed split point is right at 0.

  1. How are the leaf values related to the predictions?
  2. Why are the leaf values in the first tree not symmetric, provided that the input is symmetric?

Upvotes: 1

Views: 997

Answers (1)

Ben Reiniger
Ben Reiniger

Reputation: 12602

Before fitting the first tree, xgboost makes an initial prediction. This is controlled by the parameter base_score, which defaults to 0.5. And indeed, -2.902777 + 0.5 ~=-2.4028 and 2.652777 + 0.5 ~= 3.1528.

That also explains your second question: the differences from that initial prediction are not symmetric. If you set learning_rate=1 you probably could get the predictions to be symmetric after one round, or you could just set base_score=0.

Upvotes: 2

Related Questions