aleinikov
aleinikov

Reputation: 73

Shap or Lime with TPOT classifier

How would you go about using shap or lime or any other model interpretability tools with a TPOT exported pipeline? For example, here is some code for shap library, but you cannot pass the TPOT pipeline in to it. What would you pass in there instead?

explainer = shap.Explainer(model)
shap_values = explainer(X)

Upvotes: 0

Views: 674

Answers (1)

Alexandre Juma
Alexandre Juma

Reputation: 3313

Solution 1:

To use SHAP to explain scikit-learn Pipelines, the resulting model object of a TPOT optimization process, you need to instruct SHAP to use the Pipeline named final estimator (classifier/regressor step) and you need to transform your data with any Pipeline transformer step (i.e: pre-processor or feature selector) before feeding it to SHAP explainer.

import numpy as np
import pandas as pd
import shap
from sklearn.datasets import load_iris
from tpot import TPOTClassifier

#Let's use the Iris dataset

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.DataFrame(iris.target)

tpot = TPOTClassifier(generations=3, population_size=25, verbosity=3, random_state=42)
tpot.fit(X, y)

#Inspect resulting Pipeline. Great, 2 steps in the Pipeline: one selector and then the classifier.

tpot.fitted_pipeline_

Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.05)),
                ('logisticregression',
                 LogisticRegression(C=10.0, random_state=42))])

# Before feeding your data to the explainer, you need to transform the data up to the Pipeline step before the classifier step. 
# Beware that in this case it's just one step, but could be more.

shap_df = pd.DataFrame(tpot.fitted_pipeline_.named_steps["variancethreshold"].transform(X), columns=X.columns[tpot.fitted_pipeline_.named_steps["variancethreshold"].get_support(indices=True)])

# Finally, instruct the SHAP explainer to use the classifier step with the transformed data

shap.initjs()
explainer = shap.KernelExplainer(tpot.fitted_pipeline_.named_steps["logisticregression"].predict_proba, shap_df)
shap_values = explainer.shap_values(shap_df)

#Plot summary
shap.summary_plot(shap_values, shap_df)

Solution1 Summary Plot

Solution 2:

Apparentely scikit-learn Pipeline predict_proba() function will do what has just been described in Solution 1 (i.e: Transform the data, and apply predict_proba with the final estimator.).

In this sense, this should also work for you:

import numpy as np
import pandas as pd
import shap
from sklearn.datasets import load_iris
from tpot import TPOTClassifier

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.DataFrame(iris.target)

tpot = TPOTClassifier(generations=10, population_size=50, verbosity=3, random_state=42, template='Selector-Transformer-Classifier')
tpot.fit(X, y)

#Resulting Pipeline
Pipeline(steps=[('variancethreshold', VarianceThreshold(threshold=0.0001)),
                ('rbfsampler', RBFSampler(gamma=0.8, random_state=42)),
                ('randomforestclassifier',
                 RandomForestClassifier(bootstrap=False, criterion='entropy',
                                        max_features=0.5, min_samples_leaf=10,
                                        min_samples_split=12,
                                        random_state=42))])

explainer = shap.KernelExplainer(tpot.fitted_pipeline_.predict_proba, X)
shap_values = explainer.shap_values(X)

shap.summary_plot(shap_values, X)

Solution2 Summary Plot

Additional Remarks

You can use TreeExplainer which is must faster than the generic KernelExplainer if you use a tree-based model. As per the documentation, LightGBM, CatBoost, Pyspark and most tree-based scikit-learn models are supported.

Upvotes: 1

Related Questions