Reputation: 2502
It seems that the sum of corresponding leaf values of each tree doesn't equal to the prediction. Here is a sample code:
X = pd.DataFrame({'x': np.linspace(-10, 10, 10)})
y = X['x'] * 2
model = xgb.XGBRegressor(booster='gbtree', tree_method='exact', n_estimators=100, max_depth=1).fit(X, y)
Xtest = pd.DataFrame({'x': np.linspace(-20, 20, 101)})
Ytest = model.predict(Xtest)
plt.plot(X['x'], y, 'b.-')
plt.plot(Xtest['x'], Ytest, 'r.')
The tree dumps reads:
model.get_booster().get_dump()[:2]
['0:[x<0] yes=1,no=2,missing=1\n\t1:leaf=-2.90277791\n\t2:leaf=2.65277767\n',
'0:[x<2.22222233] yes=1,no=2,missing=1\n\t1:leaf=-1.90595233\n\t2:leaf=2.44333339\n']
If I only use one tree to do prediction:
Ytest2 = model.predict(Xtest, ntree_limit=1)
plt.plot(XX1['x'], Ytest2, '.')
np.unique(Ytest2) # array([-2.4028, 3.1528], dtype=float32)
Clearly, Ytest2
's unique values does not corresponds to the leaf value of the first tree, which is -2.90277791
and 2.65277767
, although the observed split point is right at 0.
Upvotes: 1
Views: 997
Reputation: 12602
Before fitting the first tree, xgboost makes an initial prediction. This is controlled by the parameter base_score
, which defaults to 0.5. And indeed, -2.902777 + 0.5 ~=-2.4028
and 2.652777 + 0.5 ~= 3.1528
.
That also explains your second question: the differences from that initial prediction are not symmetric. If you set learning_rate=1
you probably could get the predictions to be symmetric after one round, or you could just set base_score=0
.
Upvotes: 2