MrOne2
MrOne2

Reputation: 41

Logistic regression with statsmodels vs scikit-learn: large difference in predictions

I used the Python libraries statsmodels and scikit-learn for a logistic regression and prediction. The class probability prediction results differ quite substantially. I am aware of the fact that the solution is calculated numerically, however, I would have expected the results to differ only slightly. My expectation would have been that both use the logistic function by default - is that correct or do I need to set any options?

This is my statsmodels code:

import numpy as np
from sklearn.linear_model import LogisticRegression
x = np.array([1,2,3,4,5]).reshape((-1, 1))
y = np.array([0,0,1,1,1])
model = LogisticRegression()
model.fit(x, y)
model.predict_proba(np.array([2.5, 7]).reshape(-1,1))
Out:  array([[0.47910045, 0.52089955],
       [0.00820326, 0.99179674]])

I.e. the predictions for class 1 are 0.521 and 0.992.

If I use scikit-learn instead, I get 0.730 and 0.942:

import statsmodels.api as sm
x = [1, 2, 3, 4, 5]
y = [0,0,1,1,1]
model = sm.Logit(y, x)
results = model.fit()
results.summary()
results.predict([2.5, 7])
Out: array([0.73000205, 0.94185834])

(As a sidenote: if I use R instead of Python, the predictions are 0.480 and 1.000, i.e. they are, again, quite different.)

I suspect these differences are not numerical but there is an analytical mathematical reason behind, e.g. different functions that are used. Can someone help?

Thankss!

Upvotes: 3

Views: 1417

Answers (1)

MrOne2
MrOne2

Reputation: 41

I have now found the solution. There were two reasons:

(1) scikit-learn uses regularisation by default, which has to be turned off. This is done by changing line 5 in the scikit-learn code to:

model = LogisticRegression(penalty='none')

(2) The one that Yati Raj mentioned - thanks for the hint! Statsmodels does not fit an intercept automatically. This can be changed by adding the line

x = sm.add_constant(x)

in the statsmodels code.

Upvotes: 1

Related Questions