Reputation: 663
XGBoost supports inputting features as categories directly, which is very useful when there are a lot of categorical variables. This doesn't seem to be compatible with Shap:
import pandas as pd
import xgboost
import shap
# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
'feature_1' : [38, 83, 38, 28, 57],
'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
# Fit xgboost
model = xgboost.XGBRegressor(enable_categorical=True,
tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )
# Explain with Shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data)
Throws an error: ValueError: DataFrame.dtypes for data must be int, float, bool or category.
Is it possible to use Shap in this situation?
Upvotes: 5
Views: 5435
Reputation: 65
I found a solution to your problem with native XGBoost API.
import pandas as pd
import xgboost
import shap
# Test data
test_data = pd.DataFrame({
'target': [23, 42, 58, 29, 28],
'feature_1': [38, 83, 38, 28, 57],
'feature_2': ['A', 'B', 'A', 'C', 'A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
# Create a DMatrix (needed for native API)
features = ['feature_1', 'feature_2']
d = xgboost.DMatrix(test_data[features], label=test_data['target'],
enable_categorical=True)
# Fit xgboost
params = {'nthread': 1, 'objective': 'reg:squarederror', 'tree_method': 'hist'}
model = xgboost.train(params, d)
# Get shap values
shap_and_base_values = model.predict(d, pred_contribs=True)
# Organize data
shap_values = shap_and_base_values[:, :-1]
base_values = shap_and_base_values[:, -1]
x_np = test_data[features].to_numpy()
# Create an Explanation
explanation: shap.Explanation = shap.Explanation(
values=shap_values, base_values=base_values, feature_names=features,
data=x_np)
# Further processing as you wish
shap.plots.bar(explanation)
shap.plots.beeswarm(explanation)
I first made your problem compatible with the Native XGBoost API. Then used pred_contribs=True
in model.predict
to not return the predictions but the shap values. Then I organized the data furthermore to feed into a shap.Explanation
that can be further processed to your liking.
Upvotes: 0
Reputation: 394
Is there a reason why you are not one-hot-encoding your categorical features to begin with? XGBoost Categorical Variables: Dummification vs encoding
Especially given the fact that you want to generate SHAP values later?
If you can one-hot-encode... here is a nice description of subsequent SHAP value interpretation for categorical features (https://towardsdatascience.com/shap-for-categorical-features-7c63e6a554ea), which involves inspecting both the summed SHAP value for the categories of the original feature and boxplots of SHAP values of individual categories.
Upvotes: 0
Reputation: 198
Unfortunately, generating shap values with xgboost using categorical variables is an open issue. See, f.e., https://github.com/slundberg/shap/issues/2662
Given your specific example, I made it run using Dmatrix as input of shap (Dmatrix is the basic data type input of xgboost models, see the Learning API. The sklearn api, that you are using, doesn't need the Dmatrix, at least for training):
import pandas as pd
import xgboost as xgb
import shap
# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
'feature_1' : [38, 83, 38, 28, 57],
'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())
# Fit xgboost
model = xgb.XGBRegressor(enable_categorical=True,
tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )
# Explain with Shap
test_data_dm = xgb.DMatrix(data=test_data.drop('target', axis=1), label=test_data['target'], enable_categorical=True)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data_dm)
print(shap_values)
But the ability to generate shap values when there are categorical variables is very unstable: f.e., if you add other parameters in the xgboost you get the error "Check failed: !HasCategoricalSplit()", which is the error referenced in my first link
import pandas as pd
import xgboost as xgb
import shap
# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
'feature_1' : [38, 83, 38, 28, 57],
'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())
# Fit xgboost
model = xgb.XGBRegressor(colsample_bylevel= 0.7,
enable_categorical=True,
tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )
# Explain with Shap
test_data_dm = xgb.DMatrix(data=test_data.drop('target', axis=1), label=test_data['target'], enable_categorical=True)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data_dm)
shap_values
I've searched for a solution for months but, to conclude, as for my understanding, it is not really possible yet to generate shap values with xgboost and categorical variables (I hope someone can contradict me, with a reproducible example). I suggest you try with the Catboost
########################## EDIT ############################
An example with Catboost
import pandas as pd
import catboost as cb
import shap
# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
'feature_1' : [38, 83, 38, 28, 57],
'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())
model = cb.CatBoostRegressor(iterations=100)
model.fit(test_data.drop('target', axis=1), test_data['target'],
cat_features=['feature_2'], verbose=False)
# Explain with Shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data.drop('target', axis=1))
shap_values
print('shap values: \n',shap_values)
Upvotes: 2
Reputation: 4253
I used GradientBoostingRegressor and reshaped the array into 2 features per element
from sklearn.ensemble import GradientBoostingRegressor
import shap
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import numpy as np
df = pd.DataFrame({'target':[23,42,58,29,28],
'feature_1' : [38, 83, 38, 28, 57],
'feature_2' : ['A', 'B', 'A', 'C','A']
})
df["feature_1"]=df["feature_1"].astype(int)
df["target"]=df["target"].astype(int)
encoder = preprocessing.LabelEncoder()
df["feature_2"]=encoder.fit_transform(df["feature_2"])
print(df)
SEED=42
model = GradientBoostingRegressor(n_estimators=300, max_depth=8, random_state=SEED)
scale= StandardScaler()
#X=df[["feature_1","feature_2"]]
columns=["feature_1","feature_2"]
n_features=len(columns)
X=np.array(scale.fit_transform(df[columns])).reshape(-1,n_features)
y=np.array(df["target"])
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
model.fit(X_train,y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(df)
print(shap_values)
y_pred=model.predict(X_test)
x=np.arange(len(X_test))
plt.bar(x,y_test)
plt.bar(x,y_pred,color='green')
plt.show()
output:
target feature_1 feature_2
0 23 38 0
1 42 83 1
2 58 38 0
3 29 28 2
4 28 57 0
Shap values
[[-4.65720266 -3.00946401 0. ]
[ 2.32860133 -3.00946401 0. ]
[ 2.32860133 -3.00946401 0. ]
[-4.65720266 -3.00946401 0. ]
[-4.65720266 -3.00946401 0. ]]
or
df = pd.DataFrame({'target':[23,42,58,29,28],
'feature_1' : [38, 83, 38, 28, 57],
'feature_2' : ['A', 'B', 'A', 'C','A']
})
df["feature_1"]=df["feature_1"].astype(int)
df["target"]=df["target"].astype(int)
encoder = preprocessing.LabelEncoder()
df["feature_2"]=encoder.fit_transform(df["feature_2"])
SEED=42
#model = xgboost.XGBRegressor(enable_categorical=True,tree_method='hist')
model=xgboost.XGBRegressor(enable_categorical=True,tree_method='hist')
#model = GradientBoostingRegressor(n_estimators=100, max_depth=2, random_state=SEED)
scale= StandardScaler()
#X=df[["feature_1","feature_2"]]
columns=["feature_1","feature_2"]
n_features=len(columns)
X=np.array(scale.fit_transform(df[columns])).reshape(-1,n_features)
y=np.array(df["target"])
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.6,random_state=42)
model.fit(X_train,y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
print(shap_values)
y_pred=model.predict(X_test)
x=np.arange(len(X_test))
plt.bar(x,y_test)
plt.bar(x,y_pred,color='green')
plt.show()
Upvotes: 1