Reputation: 85
I am using xgboost
library to train a binary classifier. I would like to prevent data leakage from trained algorithm by adding noise to the weights (e.g. values at the leaf nodes of trees in the ensemble). For that I need to retrieve weights for each tree and modify them.
I can see the weights by using dump_model
or trees_to_dataframe
on the Booster object, which I define as
model = xgb.Booster(params, [dtrain])
The latter method returns a Pandas dataframe
Tree Node ID Feature Split Yes No Missing Gain Cover
0 0 0 0-0 tenure 17.0 0-1 0-2 0-1 671.161072 1595.500
1 0 1 0-1 InternetService_Fiber optic 1.0 0-3 0-4 0-3 343.489227 621.125
2 0 2 0-2 InternetService_Fiber optic 1.0 0-5 0-6 0-5 293.603149 974.375
3 0 3 0-3 tenure 4.0 0-7 0-8 0-7 95.604340 333.750
4 0 4 0-4 TotalCharges 120.0 0-9 0-10 0-9 27.897919 287.375
5 0 5 0-5 Contract_Two year 1.0 0-11 0-12 0-11 32.057739 512.625
6 0 6 0-6 tenure 60.0 0-13 0-14 0-13 120.693176 461.750
7 0 7 0-7 TechSupport_No internet service 1.0 0-15 0-16 0-15 37.326447 149.750
8 0 8 0-8 TechSupport_No internet service 1.0 0-17 0-18 0-17 34.968536 184.000
9 0 9 0-9 TechSupport_Yes 1.0 0-19 0-20 0-19 0.766754 65.500
10 0 10 0-10 MultipleLines_Yes 1.0 0-21 0-22 0-21 19.335510 221.875
11 0 11 0-11 PhoneService_Yes 1.0 0-23 0-24 0-23 19.035950 281.125
12 0 12 0-12 Leaf NaN NaN NaN NaN -0.191398 231.500
13 0 13 0-13 PaymentMethod_Electronic check 1.0 0-25 0-26 0-25 43.379410 320.875
14 0 14 0-14 Contract_Two year 1.0 0-27 0-28 0-27 13.401367 140.875
15 0 15 0-15 Leaf NaN NaN NaN NaN 0.050262 94.500
16 0 16 0-16 Leaf NaN NaN NaN NaN -0.052444 55.250
17 0 17 0-17 Leaf NaN NaN NaN NaN -0.058929 111.000
18 0 18 0-18 Leaf NaN NaN NaN NaN -0.148649 73.000
19 0 19 0-19 Leaf NaN NaN NaN NaN 0.161464 63.875
where leaf values are stored in column Gain (leaf nodes are those that have value Leaf in column Feature). Hence I could add noise to the respective rows in the Gain column, however I then do not know how to convert the Pandas dataframe back to the Booster object/XGBoost model. How should I go about achieving this? Or is there some other and better way for retrieving and modifying XGBoost leaf nodes' values?
Upvotes: 3
Views: 2028
Reputation: 879
if you use the dump_model
method, you should be able to edit the text file, then load it back into python like this:
bst = xgb.Booster({'nthread': 4}) # init model
bst.load_model('model.bin') # load data
from: https://xgboost.readthedocs.io/en/latest/python/python_intro.html
That said, adding noise won't eliminate a data leakage problem and may or may not even help, so I'd just double-check whether you have a better option available to your use case like re-training another model on the proper input data.
Upvotes: 2