Reputation: 4482
I have the following dataframe:
import pandas as pd
import random
import xgboost
import shap
foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
'var1':random.sample(range(1, 100), 10),
'var2':random.sample(range(1, 100), 10),
'var3':random.sample(range(1, 100), 10),
'class': ['a','a','a','a','a','b','b','c','c','c']})
I want to run a classification algorithm to predict the 3 classes.
So I split my dataset into a training and testing set and I ran an xgboost classification
cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
foo[['class']],
test_size=0.33, random_state=42)
model = xgboost.XGBClassifier(objective="binary:logistic")
model.fit(X_train, y_train)
Now I would like to get the mean SHAP values for each class, instead of the mean from the absolute SHAP values generated from this code:
shap_values = shap.TreeExplainer(model).shap_values(X_test)
shap.summary_plot(shap_values, X_test)
Also, the plot labels the class
as 0,1,2. How can I know to which class the 0,1 & 2 from the original correspond?
Because this code:
shap.summary_plot(shap_values, X_test,
class_names= ['a', 'b', 'c'])
gives
and this code:
shap.summary_plot(shap_values, X_test,
class_names= ['b', 'c', 'a'])
gives
So I am not sure about the legend anymore. Any ideas?
Upvotes: 5
Views: 21325
Reputation: 11
First, you need to use LabelEncoder and then classes_
import pandas as pd
import random
import xgboost
import shap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
'var1':random.sample(range(1, 100), 10),
'var2':random.sample(range(1, 100), 10),
'var3':random.sample(range(1, 100), 10),
'class': ['a','a','a','a','a','b','b','c','c','c']})
cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
foo[['class']],
test_size=0.33,
random_state=42)
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train.values.ravel())
y_test_encoded = label_encoder.transform(y_test.values.ravel())
model = xgboost.XGBClassifier(objective="multi:softprob",
num_class=len(label_encoder.classes_))
model.fit(X_train, y_train_encoded)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
classes = label_encoder.inverse_transform(range(
len(label_encoder.classes_)))
shap.summary_plot(shap_values, X_test, class_names=classes)
Upvotes: 0
Reputation: 3213
The custom solution is an over-complication, IMHO.
Solution
shap.summary_plot(shap_values, X_test, class_inds="original", class_names=model.classes_)
Explanation
summary_plot
. This has to reflect the order of predictions. Since one a priori doesn't know the order, then typically one can use model.classes_
for that purpose;shap
to stick to the original order of predictions instead of sorting them: class_inds="original"
(see the relevant code here).P.S. I use shap 0.40.0
P.P.S. I was not able to run your example as my version of XGBoost doesn't allow to use strings as target categories. But it works with label-encoded target or with other model types (sklearn.RandomForestClassifier
or lgb.LGBMClassifier
)
Upvotes: 0
Reputation: 4953
This is an updated code of @quant's code:
import pandas as pd
import random
import numpy as np
import xgboost
import shap
from sklearn.model_selection import train_test_split
import plotly_express as px
foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
'var1':random.sample(range(1, 100), 10),
'var2':random.sample(range(1, 100), 10),
'var3':random.sample(range(1, 100), 10),
'class': ['a','a','a','a','a','b','b','c','c','c']})
foo['class'], _ = pd.factorize(foo['class'], sort = True)
cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
foo[['class']],
test_size=0.33, random_state=42)
model = xgboost.XGBClassifier(objective="multi:softmax")
model.fit(X_train, y_train)
shap_values = shap.TreeExplainer(model).shap_values(X_test)
def get_ABS_SHAP(df_shap,df):
#import matplotlib as plt
# Make a copy of the input data
shap_v = pd.DataFrame(df_shap)
feature_list = df.columns
shap_v.columns = feature_list
df_v = df.copy().reset_index().drop('index',axis=1)
# Determine the correlation in order to plot with different colors
corr_list = list()
for i in feature_list:
b = np.corrcoef(shap_v[i],df_v[i])[1][0]
corr_list.append(b)
corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0)
# Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient
corr_df.columns = ['Variable','Corr']
corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue')
shap_abs = np.abs(shap_v)
k=pd.DataFrame(shap_abs.mean()).reset_index()
k.columns = ['Variable','SHAP_abs']
k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner')
k2 = k2.sort_values(by='SHAP_abs',ascending = True)
k2_f = k2[['Variable', 'SHAP_abs', 'Corr']]
k2_f['SHAP_abs'] = k2_f['SHAP_abs'] * np.sign(k2_f['Corr'])
k2_f.drop(columns='Corr', inplace=True)
k2_f.rename(columns={'SHAP_abs': 'SHAP'}, inplace=True)
return k2_f
foo_all = pd.DataFrame()
for k,v in list(enumerate(model.classes_)):
foo = get_ABS_SHAP(shap_values[k], X_test)
foo['class'] = v
foo_all = pd.concat([foo_all,foo])
px.bar(foo_all,x='SHAP', y='Variable', color='class')
Upvotes: 1
Reputation: 4482
By doing some research and with the help of this post and @Alessandro Nesti 's answer, here is my solution:
foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
'var1':random.sample(range(1, 100), 10),
'var2':random.sample(range(1, 100), 10),
'var3':random.sample(range(1, 100), 10),
'class': ['a','a','a','a','a','b','b','c','c','c']})
cl_cols = foo.filter(regex='var').columns
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
foo[['class']],
test_size=0.33, random_state=42)
model = xgboost.XGBClassifier(objective="multi:softmax")
model.fit(X_train, y_train)
def get_ABS_SHAP(df_shap,df):
#import matplotlib as plt
# Make a copy of the input data
shap_v = pd.DataFrame(df_shap)
feature_list = df.columns
shap_v.columns = feature_list
df_v = df.copy().reset_index().drop('index',axis=1)
# Determine the correlation in order to plot with different colors
corr_list = list()
for i in feature_list:
b = np.corrcoef(shap_v[i],df_v[i])[1][0]
corr_list.append(b)
corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0)
# Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient
corr_df.columns = ['Variable','Corr']
corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue')
shap_abs = np.abs(shap_v)
k=pd.DataFrame(shap_abs.mean()).reset_index()
k.columns = ['Variable','SHAP_abs']
k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner')
k2 = k2.sort_values(by='SHAP_abs',ascending = True)
k2_f = k2[['Variable', 'SHAP_abs', 'Corr']]
k2_f['SHAP_abs'] = k2_f['SHAP_abs'] * np.sign(k2_f['Corr'])
k2_f.drop(columns='Corr', inplace=True)
k2_f.rename(columns={'SHAP_abs': 'SHAP'}, inplace=True)
return k2_f
foo_all = pd.DataFrame()
for k,v in list(enumerate(model.classes_)):
foo = get_ABS_SHAP(shap_values[k], X_test)
foo['class'] = v
foo_all = pd.concat([foo_all,foo])
import plotly_express as px
px.bar(foo_all,x='SHAP', y='Variable', color='class')
Upvotes: 2
Reputation: 31
SHAP values are returned as a list. You can access the regarding SHAP absolute values via their indices.
For the summary plot of your Class 0, the code would be
shap.summary_plot(shap_values[0], X_test)
Upvotes: 3
Reputation: 21
I had the same question, perhaps this issue can help: https://github.com/slundberg/shap/issues/764
I haven't tested it yet, but it seems the order should be the same as the order you would have when calling model.predict_proba()
. In the link above it is suggested to use the class_names=model.classes_
option of the summary plot.
Upvotes: 2