Reputation: 105
I am trying to use the Dalex package in Python to visualize certain characteristics of a binary logit model.
I copied a piece of code from the example book here (the whole fifth code cell) but now I am not quite sure about how the results should be interpreted...
In my basic logit model - which I created using statsmodels
- I have manually selected one reference-level variable for each category in order to avoid multicollinearity (this means that all the results from the model are interpreted with respect to the reference level).
But when I use the piece of code from the link above (also copied below this post), it first creates some pipeline object in sklearn
, one-hot encodes the categorical variables, and then the pipeline object is fitted to the data and used in the Dalex explainer as the model to be explained.
The problem is that when I use a function like model_profile()
in Dalex, which should output a graph showing the ceteris paribus effect of a variable on the prediction, I do not know how to interpret the results because it seems as if all values in one categorical variable are included in the graph.
For example, the model shows the effect of the "gender" categorical variable on the average prediction for both male and female...
This also shows a horizontal line named "mean prediction" but what is the "mean prediction"? Was it calculated based on male as a reference level, or female?
I am really confused here about what the results mean... Can anyone please clarify this? The function model_profile()
I am trying to use is also explained in the notebook. Thank you!
The piece of code I copied:
numerical_features = ['age', 'fare', 'sibsp', 'parch']
numerical_transformer = Pipeline(
steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]
)
categorical_features = ['gender', 'class', 'embarked']
categorical_transformer = Pipeline(
steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
]
)
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
]
)
classifier = MLPClassifier(hidden_layer_sizes=(150,100,50), max_iter=500, random_state=0)
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', classifier)])
clf.fit(X, y)
exp = dx.Explainer(clf, X, y)
Upvotes: 0
Views: 1885
Reputation: 4221
This is happening because by default, sklearn
's OneHotEncoder
does the one hot transformation for every category in your data. For linear models like logit, however, it is usually preferred to leave one of the categories out to avoid multicollinearity and to make your results interpretable with regards to a reference point. In this case, you need to change the default settings of your encoder.
You can achieve that by setting drop="first"
, which drops the first category of the one hot encoding process. The example below illustrates how that would work on a simple example. Here, the "female" category is dropped from the one hot encoding and only the "male" category gets encoded, which returns the result you are expecting. Note that this also works for non-binary features.
from sklearn.preprocessing import OneHotEncoder
X = pd.DataFrame({"gender":["male","female","female","male"]})
OHE = OneHotEncoder(drop="first")
OHE.fit_transform(X).toarray()
#[[1.],
# [0.],
# [0.],
# [1.]]
OHE.get_feature_names()
#['x0_male']
So, all you need to change in your code is the following line in your pipeline definition:
'onehot', OneHotEncoder(drop='first', handle_unknown='ignore')
Upvotes: 1