Logistic regression with statsmodels vs scikit-learn: large difference in predictions

Question

I used the Python libraries statsmodels and scikit-learn for a logistic regression and prediction. The class probability prediction results differ quite substantially. I am aware of the fact that the solution is calculated numerically, however, I would have expected the results to differ only slightly. My expectation would have been that both use the logistic function by default - is that correct or do I need to set any options?

This is my statsmodels code:

import numpy as np
from sklearn.linear_model import LogisticRegression
x = np.array([1,2,3,4,5]).reshape((-1, 1))
y = np.array([0,0,1,1,1])
model = LogisticRegression()
model.fit(x, y)
model.predict_proba(np.array([2.5, 7]).reshape(-1,1))

Out:  array([[0.47910045, 0.52089955],
       [0.00820326, 0.99179674]])

I.e. the predictions for class 1 are 0.521 and 0.992.

If I use scikit-learn instead, I get 0.730 and 0.942:

import statsmodels.api as sm
x = [1, 2, 3, 4, 5]
y = [0,0,1,1,1]
model = sm.Logit(y, x)
results = model.fit()
results.summary()
results.predict([2.5, 7])

Out: array([0.73000205, 0.94185834])

(As a sidenote: if I use R instead of Python, the predictions are 0.480 and 1.000, i.e. they are, again, quite different.)

I suspect these differences are not numerical but there is an analytical mathematical reason behind, e.g. different functions that are used. Can someone help?

Thankss!

Logistic regression with statsmodels vs scikit-learn: large difference in predictions

Answers (1)

Related Questions