How to make shap.plots.scatter with xgboost.DMatrix holding missing data?

Question

I have a dataset with missing data. They are encoded as NaN. This is fine for model fitting with XGBoost. When I want to understand the model, analyzing model importance with SHAP scatter plots, I am not sure what is the correct usage.

Consider the synthetic example below:

import numpy as np, scipy.special, xgboost as xgb, shap

rng = np.random.default_rng(0)
def gendata(n):
    X = rng.normal(size=(n,1))
    y = np.sin(X[:,0]) + rng.normal(size=n)
    X[:n//2,0] = np.nan
    y = (rng.random(size=n) < scipy.special.expit(y)).astype(int)
    dmatrix = xgb.DMatrix(X,label=y,feature_names=['X0'])
    return X,y, dmatrix

X,y,dmat = gendata(10)
model = xgb.train({'objective':'reg:squarederror','booster':'gbtree'}, dmat)
explainer = shap.Explainer(model,feature_names=dmat.feature_names)
explanation = explainer(dmat);shap.plots.scatter(explanation)
explanation = explainer(X,y);shap.plots.scatter(explanation)

It produces the following two scatter plots. When using raw numpy arrays, the plot shows missing data as rug plot markers. That seems correct. When using the xgb.DMatrix, it gets zero imputation. The explanation object holds the source data correctly (a sparse matrix in the dmat and numpy arrays in the X,y case). I suppose that there is a to_dense call somewhere in the scatter function that messes up everything.

How should I do the scatter if I only have a xgb.DMatrix available?

If I only have a dma

How to make shap.plots.scatter with xgboost.DMatrix holding missing data?

Answers (1)

Related Questions