Reputation: 1
I am trying to validate the XGBoost output (booster.predict) for logistic regression wrt my understanding of output calculation via the trees built. I see a difference of around -1.58 factor in all my results. Sharing below the code I used to validate the same. I am definitely missing something here so would request help me to understand what it is.
import xgboost as xgb
import pandas as pd
import numpy as np
import math
import random
np.random.seed(1)
data = pd.DataFrame(np.arange(100*4).reshape((100,4)), columns=['a', 'b', 'c', 'd'])
label = pd.DataFrame(np.random.randint(2, size=(100,1)))
data = pd.concat([data,label], ignore_index=True, axis =1)
data = pd.DataFrame(np.arange(100*4).reshape((100,4)), columns=['a', 'b', 'c', 'd'])
features = ['a', 'b', 'c', 'd']
dtrain = xgb.DMatrix(data, label=label)
param = {"max_depth":2, "base_score":0.2, 'objective': 'binary:logistic'}
clf1 = xgb.train(param, dtrain, 2)
clf1.dump_model("base_score1.txt")
e = math.exp(-(-0.143835619-0.123642519+0.2))
print(clf1.predict(dtrain)[0],1/(1+e))
## 0.39109966 0.7583403831446165
## Ideally value of e should be 1.5568930331924702 while here e is 0.31866905973448423
Here is the tree generated
booster[0]:
0:[a<126] yes=1,no=2,missing=1
1:[a<58] yes=3,no=4,missing=3
3:leaf=0.617647052
4:leaf=0.0483870991
2:leaf=0.691919208
booster[1]:
0:leaf=0.325955093
So my understanding is that bst.predict()
outputs sigmoid applied over sum of tree_values
and base_score
, i.e. 1/(1+math.exp(-sum))
where sum = base_score+sum_of_tree_values
(i.e. as many trees are there).
What am I doing wrong?
This might be related but not sure exactly how weight calculation of individual tree in XGBoost when using "binary:logistic"
Upvotes: 0
Views: 33