desertnaut
desertnaut

Reputation: 60319

Why I get different expected_value when I include the training data in TreeExplainer?

Including the training data in SHAP TreeExplainer gives different expected_value in scikit-learn GBT Regressor.

Reproducible example (run in Google Colab):

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
import shap

shap.__version__
# 0.37.0

X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)

# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172

# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])

np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])

# explainer with training data
gbt_data_explainer = shap.TreeExplainer(model=gbt, data=X_train) # specifying feature_perturbation does not change the result
gbt_data_explainer.expected_value
# -23.564797322079635

So, the expected value when including the training data gbt_data_explainer.expected_value is quite different from the one calculated without supplying the data (gbt_explainer.expected_value).

Both approaches are additive and consistent when used with the (obviously different) respective shap_values:

np.abs(gbt_explainer.expected_value + gbt_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True

np.abs(gbt_data_explainer.expected_value + gbt_data_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True

but I wonder why they do not provide the same expected_value, and why gbt_data_explainer.expected_value is so different from the mean value of predictions.

What am I missing here?

Upvotes: 5

Views: 2923

Answers (2)

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25189

Though @Ben did a great job in digging out how the data gets passed through Independent masker, his answer does not show exactly (1) how base values are calculated and where do we get the different base value from and (2) how to choose/lower the max_samples param

Where the different value comes from

The masker object has a data attribute that holds data after masking process. To get the value showed in gbt_explainer.expected_value:

from shap.maskers import Independent
gbt = GradientBoostingRegressor(random_state=0)

# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172

# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])

gbt_explainer = shap.TreeExplainer(gbt, Independent(X_train,100))
gbt_explainer.expected_value
# -23.56479732207963

one would need to do:

masker = Independent(X_train,100)
gbt.predict(masker.data).mean()
# -23.56479732207963

What about choosing max_samples?

Setting max_samples to the original dataset length seem to work with other explainers too:

import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import shap
from shap.maskers import Independent
from scipy.special import logit, expit

corpus,y = shap.datasets.imdb()
corpus_train, corpus_test, y_train, y_test = train_test_split(corpus, y, test_size=0.2, random_state=7)

vectorizer = TfidfVectorizer(min_df=10)
X_train = vectorizer.fit_transform(corpus_train)

model = sklearn.linear_model.LogisticRegression(penalty="l2", C=0.1)
model.fit(X_train, y_train)

explainer = shap.Explainer(model
                           ,masker = Independent(X_train,100)
                           ,feature_names=vectorizer.get_feature_names()
                          )
explainer.expected_value
# -0.18417413671991964

This value comes from:

masker=Independent(X_train,100)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[...,1])
# array([-0.18417414])

max_samples=100 seem to be a bit off for a true base_value (just feeding feature means):

logit(model.predict_proba(X_train.mean(0).reshape(1,-1))[:,1])
array([-0.02938039])

By increasing max_samples one might get reasonably close to true baseline, while keeping num of samples low:

masker = Independent(X_train,1000)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[:,1])
# -0.05957302658674238

So, to get base value for an explainer of interest (1) pass explainer.data (or masker.data) through your model and (2) choose max_samples so that base_value on sampled data is close enough to the true base value. You may also try to observe if the values and order of shap importances converge.

Some people may notice that to get to the base values sometimes we average feature inputs (LogisticRegression) and sometimes outputs (GBT)

Upvotes: 3

Ben Reiniger
Ben Reiniger

Reputation: 12602

Apparently shap subsets to 100 rows when data is passed, then runs those rows through the trees to reset the sample counts for each node. So the -23.5... being reported is the average model output for those 100 rows.

The data is passed to an Independent masker, which does the subsampling:
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_tree.py#L94
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_explainer.py#L68
https://github.com/slundberg/shap/blob/v0.37.0/shap/maskers/_tabular.py#L216

Running

from shap import maskers

another_gbt_explainer = shap.TreeExplainer(
    gbt,
    data=maskers.Independent(X_train, max_samples=800),
    feature_perturbation="tree_path_dependent"
)
another_gbt_explainer.expected_value

gets back to

-11.534353657511172

Upvotes: 6

Related Questions