stat_is_quo
stat_is_quo

Reputation: 143

Why are probabilities hand-calculated from sklearn.linear_model.LogisticRegression coefficients different from .predict_proba()?

I am running a multinomial logistic regression in sklearn, using sklearn.linear_model.LogisticRegression(multiclass="multinomial"). The dependent categorical variable has 3 options: Agree, Disagree, Unsure. The independent variables are two categorical variables: Education and Gender (binary gender for simplicity in this example). I get different results when I hand-calculate the probabilities from the regression coefficients versus use the built-in predict_proba().

mnlr = LogisticRegression(multi_class="multinomial")
mnlr.fit(
    pd.get_dummies(df[["Education","Gender"]]),
    preprocessing.LabelEncoder().fit_transform(df["statement"])
)

I concatenate the outputs of mnlr.intercept_ and mnlr.coef_ into a regression coefficients table that looks like this: multinomial logistic regression coefficients

Using mnlr.predict_proba(), I get results that I cast into a dataframe to which I add the independent variables like this: predicted probabilities from sklearn function

These sum to 1 across the 3 potential categories for each data point.

However, I cannot seem to reproduce these results when I try to calculate the predicted probabilities by hand from the logistic regression coefficients.

First, for each Gender x Education combination, I calculate the logit (aka log-odds, if I understand correctly) by simply adding the intercept and the relevant variable terms. For example, to get the logit for a Woman with a Bachelor's degree with the Agree regression: 0.88076 + 0.21827 + 0.21687 = 1.31590. The table of logits looks like this: logit table

From this table, as I understand it, I should be able to convert these logits (log-odds) to predicted probabilities: p = e^logit/(1+e^logit) for a given model and respondent (e.g., probability that Women with Bachelor's Agree with the statement). When I try this, however, I get much different results than I receive from .predict_proba() and the hand-calculated probabilities do not sum to 1, as indicated in the table below: hand-calculated predicted probabilities look wrong :(

For example, Women with Bachelor's here have a 0.78850 probability to Agree with the statement, in place of the 0.7819 probability. Additionally, the hand-calculated probabilities across the 3 categories do not sum to 1, but rather to 1.47146.

I am almost certain this is a basic error on my part, but I cannot for the life of me figure it out. What am I doing incorrectly?

Upvotes: 0

Views: 311

Answers (1)

stat_is_quo
stat_is_quo

Reputation: 143

I figured this one out eventually. The answer is probably obvious to folks who really know multinomial logistic regression. The struggle I was having was that I needed to apply the softmax function (also known more descriptively as the normalized exponential function) to the logits. This function involves exponentiating the logit (log-odds) for each class and then dividing it by the sum of exponentiated logits for all classes. In this example, for Women with a Bachelor's degree, this would mean: softmax example spelled out

= softmax example spelled out 2

= 0.737007424626824

Hopefully this will be helpful to anyone else trying to understand how to do this by hand! (Which for me is really useful for trying to apply model-based inference as an alternative to design-based inference in sample surveys).

Sources that got me here: How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification, https://en.wikipedia.org/wiki/Softmax_function

Upvotes: 1

Related Questions