Reputation: 3
I face a problem with using SHAP value to interpret a Tree-based model (https://github.com/slundberg/shap).
First, I have input around 30 features and I have 2 features that have high positive correlation between them.
After that, I train the XGBoost model(python) and look at SHAP values of 2 features the SHAP values have negative correlation.
Could you explain to me, why the output SHAP values between 2 features doesn't have the correlation the same as input correlation? and I can trust that output of SHAP or not?
=========================
Correlation between input: 0.91788
Correlation between SHAP values: -0.661088
2 features are
Model Performance
Train AUC: 0.73
Test AUC: 0.71
Input scatter plot (x: Number of families in province, y: Population in province):
SHAP values output scatter plot (x: Number of families in province, y: Population in province):
Upvotes: 0
Views: 1169
Reputation: 59
XGBoost is not a linear model, i.e. the relationship between the input features X and the predictions y is not linear. SHAP values build a linear explanation model of y. Therefore, it is very much expected that the correlation between input features and their SHAP values do not match.
Upvotes: 0
Reputation: 1063
You can have correlated variables that have opposite effects on the model output.
As an example, let's take the case of predicting risk of mortality given two features: 'age' and 'trips to doctors'. Although these two variables are positively correlated, their effects are different. All other things held constant, a higher 'age' leads to a higher risk of mortality (according to the trained model). And a higher number of 'trips to doctor' leads to a smaller risk of mortality.
XGBoost (and SHAP) isolates the effect of these two correlated variables by conditioning on the other variable: e.g. splitting on 'trips to doctors' feature, after splitting on 'age' feature. Assumption here is that they are not perfectly correlated.
Upvotes: 1