What is the expected_value field of TreeExplainer for a Random Forest?

I used SHAP to explain my RF

RF_best_parameters = RandomForestRegressor(random_state=24, n_estimators=100)
RF_best_parameters.fit(X_train, y_train.values.ravel())
shap_explainer_model = shap.TreeExplainer(RF_best_parameters)

The TreeExplainer class has an attribute expected_value. My first guess that this field is the mean of the predicted y, according to the X_train (I also read this here )

But it is not.
The output of the command:

shap_explainer_model.expected_value

is 0.2381.

The output of the command:

RF_best_parameters.predict(X_train).mean()

is 0.2389.

As we can see the values are not same. So what is the meaning of the expected_value here?

Upvotes: 5

Answers (2)

desertnaut

Reputation: 60319

This is due to a peculiarity of the method when used with the Random Forest algorithm; quoting from the response in the relevant Github thread shap explainer expected_value is different from model expected value:

It is because of how sklearn records the training samples in the tree models it builds. Random forests use a random subsample of the data to train each tree, and it is that random subsample that is used in sklearn to record the leaf sample weights in the model. Since TreeExplainer uses the recorded leaf sample weights to represent the training dataset, it will depend on the random sampling used during training. This will cause small variations like the ones you are seeing.

We can actually verify that this behavior is not present with other algorithms, say Gradient Boosting Trees:

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
import numpy as np

import shap
shap.__version__
# 0.37.0

X, y = make_regression(n_samples=1000, n_features=10, random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)

mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172

gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])

np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])

But for RF, we get indeed a "small variation" as mentioned by the main SHAP developer in the thread above:

rf = RandomForestRegressor(random_state=0)
rf.fit(X_train, y_train)

rf_explainer = shap.TreeExplainer(rf)
rf_explainer.expected_value
# array([-11.59166808])

mean_pred_rf = np.mean(rf.predict(X_train))
mean_pred_rf
# -11.280125877556388

np.isclose(mean_pred_rf, rf_explainer.expected_value)
# array([False])

Upvotes: 5

hellowolrd

Reputation: 391

Just try :

shap_explainer_model = shap.TreeExplainer(RF_best_parameters, data=X_train, feature_perturbation="interventional", model_output="raw")

Then the shap_explainer_model.expected_value should give you the mean prediction of your model on train data.

Otherwise, TreeExplainer uses feature_perturbation="tree_path_dependent"; accoding to the documentation:

The “tree_path_dependent” approach is to just follow the trees and use the number of training examples that went down each leaf to represent the background distribution. This approach does not require a background dataset and so is used by default when no background dataset is provided.

Upvotes: 1

What is the expected_value field of TreeExplainer for a Random Forest?

Answers (2)

Related Questions