Reputation: 1973
I have been playing around the toy dataset to understand more about shap library and usage. I found this issue that the feature importances from the catboost regressor model is different than the features importances from the summary_plot in the shap library.
I am analyzing the feature importance from the model.feature_importances_ on X_train set and the summary plot from shap explainer on X_test set.
Here is my source code -
import catboost
from catboost import *
import shap
shap.initjs()
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
X,y = shap.datasets.boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Train Model
model = CatBoostRegressor(iterations=300, learning_rate=0.1, random_seed=123)
model.fit(X_train, y_train, verbose=False, plot=False)
# Compute feature importance dataframe
feat_imp_list = list(zip ( list(model.feature_importances_) , model.feature_names_) )
feature_imp_df = pd.DataFrame(sorted(feat_imp_list, key=lambda x: x[0], reverse=True) , columns = ['feature_value','feature_name'])
feature_imp_df
# Run shap explainer on X_test set
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
Why does DIS show up at rank 3 in the feature importance plot from Model but shows up at rank 7 in the summary plot from the SHAP library?
Upvotes: 5
Views: 10274
Reputation: 300
Feature importance are always positive where as shap values are coefficients attached to independent variables(it can be negative and positive both).
Both are give you results in descending order: -In Feature Importance you can see it start from max and goes down to min. Its sum necessarily need to be 100(i.e.100%) in any case. -For shape values it just the coefficient attached to that particular feature. This is also in descending order (start from highest coefficient to lowest value). Its sum can be anything in real line(for any case).
P.S. you can compare these shap coefficients with coefficient from logistic regression model for better understanding.
Cheers!
Upvotes: 4