Reputation: 412
Using xgboost.Booster.predict can only get the prediction result of all the tree or the predicted leaf of each tree. But how could I get the prediction value of each tree?
Upvotes: 14
Views: 5720
Reputation: 741
Much better solution is this.
In Python, you can dump the trees as a list of strings:
example:
m = xgb.XGBClassifier(max_depth=2, n_estimators=3).fit(X, y)
m.get_booster().get_dump()
this is what you'll get:
booster[0]:
0:[sincelastrun<23.2917] yes=1,no=2,missing=2
1:[sincelastrun<18.0417] yes=3,no=4,missing=4
3:leaf=-0.0965415
4:leaf=-0.0679503
2:[sincelastrun<695.025] yes=5,no=6,missing=6
5:leaf=-0.0992546
6:leaf=-0.0984374
booster[1]:
0:[sincelastrun<23.2917] yes=1,no=2,missing=2
1:[sincelastrun<16.8917] yes=3,no=4,missing=4
3:leaf=-0.0928132
4:leaf=-0.0676056
2:[sincelastrun<695.025] yes=5,no=6,missing=6
5:leaf=-0.0945284
6:leaf=-0.0937463
booster[2]:
0:[sincelastrun<23.2917] yes=1,no=2,missing=2
1:[sincelastrun<18.175] yes=3,no=4,missing=4
3:leaf=-0.0878571
4:leaf=-0.0610089
2:[sincelastrun<695.025] yes=5,no=6,missing=6
5:leaf=-0.0904395
6:leaf=-0.0896808
Upvotes: 0
Reputation: 3305
As of recently, xgboost
has introduced a slicing API, and Raul's answer, while valid, is overly complicated.
To get individual predictions all you need is to iterate through the booster
object.
individual_preds = []
for tree_ in model.get_booster():
individual_preds.append(
tree_.predict(xgb.DMatrix(X))
)
Note however, that those individual predictions are not individual contributions. E.g. summing them up will not get the final prediction. For that we need to transform them back into log-odds and then sum up:
from scipy.special import expit as sigmoid, logit as inverse_sigmoid
individual_preds = np.vstack(individual_preds)
indivudual_logits = inverse_sigmoid(individual_preds)
final_logits = indivudual_logits.sum(axis=0)
final_preds = sigmoid(final_logits)
Fully reproducible example, replicating Raul's efforts
import numpy as np
import xgboost as xgb
from sklearn import datasets
from scipy.special import expit as sigmoid, logit as inverse_sigmoid
# Load data
iris = datasets.load_iris()
X, y = iris.data, (iris.target == 1).astype(int)
# Fit a model
model = xgb.XGBClassifier(
n_estimators=10,
max_depth=10,
use_label_encoder=False,
objective='binary:logistic'
)
model.fit(X, y)
booster_ = model.get_booster()
# Extract indivudual predictions
individual_preds = []
for tree_ in booster_:
individual_preds.append(
tree_.predict(xgb.DMatrix(X))
)
individual_preds = np.vstack(individual_preds)
# Aggregated individual predictions to final predictions
indivudual_logits = inverse_sigmoid(individual_preds)
final_logits = indivudual_logits.sum(axis=0)
final_preds = sigmoid(final_logits)
# Verify correctness
xgb_preds = booster_.predict(xgb.DMatrix(X))
np.testing.assert_almost_equal(final_preds, xgb_preds)
Upvotes: 8
Reputation: 781
The xgboost.core.Booster
has two methods that allows you to do so:
First, xgboost.core.Booster.predict
with the parameter pred_leaf
set to True
allows you to get the predicted leaf indices. Then, is just a matter of getting those indices scores.
To get the leaf scores, we resort to the method xgboost.core.Booster.dump_model
, which dumps the structure of the tree ensemble as a plain text or json. The dump contains the leaf scores.
Below I show an example.
First, train a xgboost model on the Iris Dataset.
import os
import json
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn import datasets
# Load data
iris = datasets.load_iris()
X, y = iris.data, iris.target
y = (y == 1).astype(int)
# Fit a model
n_estimators = 10
max_depth = 10
model = xgb.XGBClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
min_child_weight=1)
model.fit(X, y)
booster = model.get_booster()
Then get leaf indices predictions.
pred_leaf_index = booster.predict(
xgb.DMatrix(X),
pred_leaf=True
).reshape(X.shape[0], n_estimators)
To get the leaf scores we to dump the model as a json file. The resulting dump contains the tree structure.
# Dump the model and load the dump
model_json_path = '/tmp/model.json'
booster.dump_model(model_json_path, dump_format='json')
with open(model_json_path, 'r') as f:
model_dict = json.loads(f.read())
Now, the following is perhaps the most complex part of this process. The following functions are aimed to get only the leaf scores by each three then for the entire ensamble:
def get_tree_leaf_scores(tree):
"""Retrieve a single tree leaf scores.
Parameters
----------
tree : dict
A dictionary representing a single xgboost decision tree
(one item of the dump generated by `booster.dump_model`).
Returns
-------
leafs : list
Each item of the list is the left and right final leafs of
the final branch of a tree.
"""
if 'leaf' in tree:
return tree
else:
branch_0 = get_tree_leaf_scores(tree['children'][0])
branch_1 = get_tree_leaf_scores(tree['children'][1])
if not isinstance(branch_0, list):
branch_0 = [branch_0]
if not isinstance(branch_1, list):
branch_1 = [branch_1]
return branch_0 + branch_1
def get_trees_leaf_as_dataframe(model_dict):
"""Retrieve the tree ensemble leaf scores.
Parameters
----------
model_dict : dict
The dictionary from loading the dump resulting from:
`xgboost.core.Booster.dump_model`
Returns
-------
trees_leaf_df : pandas.DataFrame
Tree/node ids with their leaf score.
"""
# Get tree nodes
trees_leaf_df = []
for tree_idx, tree in enumerate(model_dict):
tree_leafs = get_tree_leaf_scores(tree)
tree_leafs = pd.DataFrame(tree_leafs)
tree_leafs['treeid'] = tree_idx
trees_leaf_df.append(tree_leafs)
trees_leaf_df = pd.concat(
trees_leaf_df
).sort_values(['treeid', 'nodeid'])
trees_leaf_df['id'] = \
trees_leaf_df.apply(
lambda x: '%s-%s' % (int(x['treeid']), int(x['nodeid'])), axis=1)
trees_leaf_df = trees_leaf_df[
['treeid', 'nodeid', 'id', 'leaf']
].set_index('id')
return trees_leaf_df
Here is how you get the leaf scores as a DataFrame:
trees_leaf_df = get_trees_leaf_as_dataframe(model_dict)
trees_leaf_df.head()
Out[1]:
nodeid leaf treeid id
0 1 -0.555556 0 0-1
4 4 -0.528000 0 0-4
3 6 -0.120000 0 0-6
1 7 0.150000 0 0-7
2 8 0.550000 0 0-8
At this point we are ready to get the model predicted leaf scores, with the help of the following function:
def get_pred_leaf_scores(pred_leaf_index, trees_leaf_df):
"""
Return
------
The predicted leaf scores.
"""
tree_ids = range(0, n_estimators)
pred_leaf_scores = []
for single_instance_pred_leafs in pred_leaf_index:
tree_node_id_predictions = [
'%s-%s' % (treeid, nodeid)
for treeid, nodeid in zip(tree_ids, single_instance_pred_leafs)]
single_instnace_pred_leaf_scores = trees_leaf_df.loc[
tree_node_id_predictions]['leaf'].values
pred_leaf_scores.append(single_instnace_pred_leaf_scores)
pred_leaf_scores = pd.DataFrame(pred_leaf_scores)
return pred_leaf_scores
pred_leaf_scores = get_pred_leaf_scores(pred_leaf_index, trees_leaf_df)
pred_leaf_scores
Out[2]:
0 1 2 ... 7 8 9
0 -0.555556 -0.434605 -0.373621 ... -0.248634 -0.231758 -0.215499
1 -0.555556 -0.434605 -0.373621 ... -0.248634 -0.231758 -0.215499
2 -0.555556 -0.434605 -0.373621 ... -0.248634 -0.231758 -0.215499
3 -0.555556 -0.434605 -0.373621 ... -0.248634 -0.231758 -0.215499
4 -0.555556 -0.434605 -0.373621 ... -0.248634 -0.231758 -0.215499
.. ... ... ... ... ... ... ...
145 -0.528000 -0.410725 -0.374272 ... -0.072375 -0.236201 -0.058543
146 -0.528000 -0.410725 -0.374272 ... -0.024406 -0.236201 -0.185685
147 -0.528000 -0.410725 -0.374272 ... -0.072375 -0.236201 -0.058543
148 -0.528000 -0.410725 -0.374272 ... -0.250879 -0.236201 -0.215589
149 -0.528000 -0.410725 -0.374272 ... -0.072375 -0.236201 -0.058543
[150 rows x 10 columns]
If you want to make sure that the leaf scores yield the same probability predictions, do the following:
def from_leafs_scores_to_proba(pred_leaf_scores):
"""
"""
# Get logistic function logit.
logit = pred_leaf_scores.sum(axis=1)
# Compute the logistic function
pos_class_probability = 1 / (1 + np.exp(-logit))
# Get negative and positive class probabilities.
return pos_class_probability
y_scores_from_leafs = from_leafs_scores_to_proba(pred_leaf_scores)
y_scores_from_leafs.values[:10]
Out[9]:
array([0.03715579, 0.03715579, 0.03715579, 0.03715579, 0.03715579,
0.03715579, 0.03715579, 0.03715579, 0.03715579, 0.03715579])
y_scores = model.predict_proba(X)[:, 1]
y_scores[:10]
Out[10]:
array([0.03715578, 0.03715578, 0.03715578, 0.03715578, 0.03715578,
0.03715578, 0.03715578, 0.03715578, 0.03715578, 0.03715578],
dtype=float32)
Upvotes: 2