Reputation: 897
I have been trying to work with the shap
package. I want to determine the shap values from my logistic regression model. Contrary to the TreeExplainer
, the LinearExplainer
requires a so-called masker. What exactly does this masker do and what is the difference between the independent and partition maskers?
Also, I am interested in the important features from the test-set. Do I then fit the masker on the training set or the test set? Below you can see a snippet of code.
model = LogisticRegression(random_state = 1)
model.fit(X_train, y_train)
masker = shap.maskers.Independent(data = X_train)
**or**
masker = shap.maskers.Independent(data = X_test)
explainer = shap.LinearExplainer(model, masker = masker)
shap_val = explainer(X_test)```
Upvotes: 12
Views: 10097
Reputation: 25189
Masker class provides a background data to "train" your explainer against. I.e., in:
explainer = shap.LinearExplainer(model, masker = masker)
you're using background data determined by masker (you may see what data is used by accessing masker.data
attribute). You may read more about "true to model" or "true to data" explanations here or here.
Given above, calculation-wise you may do both:
masker = shap.maskers.Independent(data = X_train)
or
masker = shap.maskers.Independent(data = X_test)
explainer = shap.LinearExplainer(model, masker = masker)
but conceptually, imo the following makes more sense:
masker = shap.maskers.Independent(data = X_train)
explainer = shap.LinearExplainer(model, masker = masker)
This is akin usual train/test
paradigm, where you train your model (and explainer) on train data, and try to predict (and explain) your test data.
Unrelated to the question. An alternative to masker, which samples data for you, would be to explicitly provide background that may allow comparing 2 datapoints: a point against which compare, and the point of interest, like in this notebook. In such a manner one may find out why 2 seemingly similar datapoints were classified differently.
Upvotes: 21