prmlmu
prmlmu

Reputation: 663

Shap summary plots for XGBoost with categorical data inputs

XGBoost supports inputting features as categories directly, which is very useful when there are a lot of categorical variables. This doesn't seem to be compatible with Shap:

import pandas as pd
import xgboost
import shap

# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')

# Fit xgboost
model = xgboost.XGBRegressor(enable_categorical=True,
                                       tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )

# Explain with Shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data)

Throws an error: ValueError: DataFrame.dtypes for data must be int, float, bool or category.

Is it possible to use Shap in this situation?

Upvotes: 5

Views: 5435

Answers (4)

MachineLeon
MachineLeon

Reputation: 65

I found a solution to your problem with native XGBoost API.

import pandas as pd
import xgboost
import shap

# Test data
test_data = pd.DataFrame({
    'target': [23, 42, 58, 29, 28],
    'feature_1': [38, 83, 38, 28, 57],
    'feature_2': ['A', 'B', 'A', 'C', 'A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')

# Create a DMatrix (needed for native API)
features = ['feature_1', 'feature_2']
d = xgboost.DMatrix(test_data[features], label=test_data['target'],
                    enable_categorical=True)

# Fit xgboost
params = {'nthread': 1, 'objective': 'reg:squarederror', 'tree_method': 'hist'}
model = xgboost.train(params, d)

# Get shap values
shap_and_base_values = model.predict(d, pred_contribs=True)

# Organize data
shap_values = shap_and_base_values[:, :-1]
base_values = shap_and_base_values[:, -1]
x_np = test_data[features].to_numpy()

# Create an Explanation
explanation: shap.Explanation = shap.Explanation(
        values=shap_values, base_values=base_values, feature_names=features,
        data=x_np)

# Further processing as you wish
shap.plots.bar(explanation)
shap.plots.beeswarm(explanation)

I first made your problem compatible with the Native XGBoost API. Then used pred_contribs=True in model.predict to not return the predictions but the shap values. Then I organized the data furthermore to feed into a shap.Explanation that can be further processed to your liking.

Upvotes: 0

seapen
seapen

Reputation: 394

Is there a reason why you are not one-hot-encoding your categorical features to begin with? XGBoost Categorical Variables: Dummification vs encoding

Especially given the fact that you want to generate SHAP values later?

If you can one-hot-encode... here is a nice description of subsequent SHAP value interpretation for categorical features (https://towardsdatascience.com/shap-for-categorical-features-7c63e6a554ea), which involves inspecting both the summed SHAP value for the categories of the original feature and boxplots of SHAP values of individual categories.

Upvotes: 0

user17788510
user17788510

Reputation: 198

Unfortunately, generating shap values with xgboost using categorical variables is an open issue. See, f.e., https://github.com/slundberg/shap/issues/2662

Given your specific example, I made it run using Dmatrix as input of shap (Dmatrix is the basic data type input of xgboost models, see the Learning API. The sklearn api, that you are using, doesn't need the Dmatrix, at least for training):

import pandas as pd
import xgboost as xgb
import shap

# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())
# Fit xgboost
model = xgb.XGBRegressor(enable_categorical=True,
                                       tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )

# Explain with Shap
test_data_dm = xgb.DMatrix(data=test_data.drop('target', axis=1), label=test_data['target'], enable_categorical=True)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data_dm)
print(shap_values)

But the ability to generate shap values when there are categorical variables is very unstable: f.e., if you add other parameters in the xgboost you get the error "Check failed: !HasCategoricalSplit()", which is the error referenced in my first link

import pandas as pd
import xgboost as xgb
import shap

# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())
# Fit xgboost
model = xgb.XGBRegressor(colsample_bylevel= 0.7, 
                             enable_categorical=True,
                             tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )

# Explain with Shap
test_data_dm = xgb.DMatrix(data=test_data.drop('target', axis=1), label=test_data['target'], enable_categorical=True)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data_dm)
shap_values

I've searched for a solution for months but, to conclude, as for my understanding, it is not really possible yet to generate shap values with xgboost and categorical variables (I hope someone can contradict me, with a reproducible example). I suggest you try with the Catboost

########################## EDIT ############################

An example with Catboost

import pandas as pd
import catboost as cb
import shap

# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())

model = cb.CatBoostRegressor(iterations=100)
model.fit(test_data.drop('target', axis=1), test_data['target'],
                    cat_features=['feature_2'], verbose=False)

# Explain with Shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data.drop('target', axis=1))
shap_values
print('shap values: \n',shap_values)

Upvotes: 2

I used GradientBoostingRegressor and reshaped the array into 2 features per element

from sklearn.ensemble import GradientBoostingRegressor
import shap
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import numpy as np

df = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']
                  })

df["feature_1"]=df["feature_1"].astype(int)
df["target"]=df["target"].astype(int)

encoder = preprocessing.LabelEncoder()
df["feature_2"]=encoder.fit_transform(df["feature_2"])

print(df)
SEED=42
    model = GradientBoostingRegressor(n_estimators=300, max_depth=8, random_state=SEED)

scale= StandardScaler()

#X=df[["feature_1","feature_2"]]
columns=["feature_1","feature_2"]
n_features=len(columns)
X=np.array(scale.fit_transform(df[columns])).reshape(-1,n_features)
y=np.array(df["target"])
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
model.fit(X_train,y_train)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(df)

print(shap_values)


y_pred=model.predict(X_test)

x=np.arange(len(X_test))
plt.bar(x,y_test)
plt.bar(x,y_pred,color='green')
plt.show()

output:

target  feature_1  feature_2
0      23         38          0
1      42         83          1
2      58         38          0
3      29         28          2
4      28         57          0

Shap values

[[-4.65720266 -3.00946401  0.        ]
 [ 2.32860133 -3.00946401  0.        ]
 [ 2.32860133 -3.00946401  0.        ]
 [-4.65720266 -3.00946401  0.        ]
 [-4.65720266 -3.00946401  0.        ]]

or

    df = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']
                  })

df["feature_1"]=df["feature_1"].astype(int)
df["target"]=df["target"].astype(int)

encoder = preprocessing.LabelEncoder()
df["feature_2"]=encoder.fit_transform(df["feature_2"])

SEED=42
#model = xgboost.XGBRegressor(enable_categorical=True,tree_method='hist')
model=xgboost.XGBRegressor(enable_categorical=True,tree_method='hist')
#model = GradientBoostingRegressor(n_estimators=100, max_depth=2, random_state=SEED)

scale= StandardScaler()

#X=df[["feature_1","feature_2"]]
columns=["feature_1","feature_2"]
n_features=len(columns)
X=np.array(scale.fit_transform(df[columns])).reshape(-1,n_features)
y=np.array(df["target"])
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.6,random_state=42)
model.fit(X_train,y_train)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
print(shap_values)


y_pred=model.predict(X_test)

x=np.arange(len(X_test))
plt.bar(x,y_test)
plt.bar(x,y_pred,color='green')
plt.show()

Upvotes: 1

Related Questions