how are the leaf values of xgboost regression trees relate to the prediction

Question

It seems that the sum of corresponding leaf values of each tree doesn't equal to the prediction. Here is a sample code:

X = pd.DataFrame({'x': np.linspace(-10, 10, 10)})
y = X['x'] * 2
model = xgb.XGBRegressor(booster='gbtree', tree_method='exact', n_estimators=100, max_depth=1).fit(X, y)
Xtest = pd.DataFrame({'x': np.linspace(-20, 20, 101)})
Ytest = model.predict(Xtest)
plt.plot(X['x'], y, 'b.-')
plt.plot(Xtest['x'], Ytest, 'r.')

The tree dumps reads:

model.get_booster().get_dump()[:2]

['0:[x<0] yes=1,no=2,missing=1
	1:leaf=-2.90277791
	2:leaf=2.65277767
',
 '0:[x<2.22222233] yes=1,no=2,missing=1
	1:leaf=-1.90595233
	2:leaf=2.44333339
']

If I only use one tree to do prediction:

Ytest2 = model.predict(Xtest, ntree_limit=1)
plt.plot(XX1['x'], Ytest2, '.')
np.unique(Ytest2)  # array([-2.4028,  3.1528], dtype=float32)

Clearly, Ytest2's unique values does not corresponds to the leaf value of the first tree, which is -2.90277791 and 2.65277767, although the observed split point is right at 0.

How are the leaf values related to the predictions?
Why are the leaf values in the first tree not symmetric, provided that the input is symmetric?

Ben Reiniger · Accepted Answer

Before fitting the first tree, xgboost makes an initial prediction. This is controlled by the parameter base_score, which defaults to 0.5. And indeed, -2.902777 + 0.5 ~=-2.4028 and 2.652777 + 0.5 ~= 3.1528.

That also explains your second question: the differences from that initial prediction are not symmetric. If you set learning_rate=1 you probably could get the predictions to be symmetric after one round, or you could just set base_score=0.

how are the leaf values of xgboost regression trees relate to the prediction

Answers (1)

Related Questions