One hot encoding for logit and the sklearn pipeline

Question

I am trying to use the Dalex package in Python to visualize certain characteristics of a binary logit model.

I copied a piece of code from the example book here (the whole fifth code cell) but now I am not quite sure about how the results should be interpreted...

In my basic logit model - which I created using statsmodels - I have manually selected one reference-level variable for each category in order to avoid multicollinearity (this means that all the results from the model are interpreted with respect to the reference level).

But when I use the piece of code from the link above (also copied below this post), it first creates some pipeline object in sklearn, one-hot encodes the categorical variables, and then the pipeline object is fitted to the data and used in the Dalex explainer as the model to be explained.

The problem is that when I use a function like model_profile() in Dalex, which should output a graph showing the ceteris paribus effect of a variable on the prediction, I do not know how to interpret the results because it seems as if all values in one categorical variable are included in the graph.

For example, the model shows the effect of the "gender" categorical variable on the average prediction for both male and female...

This also shows a horizontal line named "mean prediction" but what is the "mean prediction"? Was it calculated based on male as a reference level, or female?

I am really confused here about what the results mean... Can anyone please clarify this? The function model_profile() I am trying to use is also explained in the notebook. Thank you!

The piece of code I copied:

    numerical_features = ['age', 'fare', 'sibsp', 'parch']
    numerical_transformer = Pipeline(
        steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]
    )
    
    categorical_features = ['gender', 'class', 'embarked']
    categorical_transformer = Pipeline(
        steps=[
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))
        ]
    )
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_features),
            ('cat', categorical_transformer, categorical_features)
        ]
    )
    
    classifier = MLPClassifier(hidden_layer_sizes=(150,100,50), max_iter=500, random_state=0)
    
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', classifier)])
    clf.fit(X, y)
    exp = dx.Explainer(clf, X, y)

One hot encoding for logit and the sklearn pipeline

Answers (1)

Why is this happening?

Example

What you need to do

Related Questions