Yokkung
Yokkung

Reputation: 3

SHAP value can explain right?

I face a problem with using SHAP value to interpret a Tree-based model (https://github.com/slundberg/shap).

First, I have input around 30 features and I have 2 features that have high positive correlation between them.
After that, I train the XGBoost model(python) and look at SHAP values of 2 features the SHAP values have negative correlation.

Could you explain to me, why the output SHAP values between 2 features doesn't have the correlation the same as input correlation? and I can trust that output of SHAP or not?

=========================
Correlation between input: 0.91788
Correlation between SHAP values: -0.661088

2 features are

  1. Population in a province
  2. Number of family in a province

Model Performance
Train AUC: 0.73
Test AUC: 0.71

Input scatter plot (x: Number of families in province, y: Population in province):
Input scatter plot

SHAP values output scatter plot (x: Number of families in province, y: Population in province):
SHAP values scatter plot

Upvotes: 0

Views: 1169

Answers (2)

Lingchao Mao
Lingchao Mao

Reputation: 59

XGBoost is not a linear model, i.e. the relationship between the input features X and the predictions y is not linear. SHAP values build a linear explanation model of y. Therefore, it is very much expected that the correlation between input features and their SHAP values do not match.

Upvotes: 0

Thom Lane
Thom Lane

Reputation: 1063

You can have correlated variables that have opposite effects on the model output.

As an example, let's take the case of predicting risk of mortality given two features: 'age' and 'trips to doctors'. Although these two variables are positively correlated, their effects are different. All other things held constant, a higher 'age' leads to a higher risk of mortality (according to the trained model). And a higher number of 'trips to doctor' leads to a smaller risk of mortality.

XGBoost (and SHAP) isolates the effect of these two correlated variables by conditioning on the other variable: e.g. splitting on 'trips to doctors' feature, after splitting on 'age' feature. Assumption here is that they are not perfectly correlated.

Upvotes: 1

Related Questions