Reputation: 41
I used the Python libraries statsmodels and scikit-learn for a logistic regression and prediction. The class probability prediction results differ quite substantially. I am aware of the fact that the solution is calculated numerically, however, I would have expected the results to differ only slightly. My expectation would have been that both use the logistic function by default - is that correct or do I need to set any options?
This is my statsmodels code:
import numpy as np
from sklearn.linear_model import LogisticRegression
x = np.array([1,2,3,4,5]).reshape((-1, 1))
y = np.array([0,0,1,1,1])
model = LogisticRegression()
model.fit(x, y)
model.predict_proba(np.array([2.5, 7]).reshape(-1,1))
Out: array([[0.47910045, 0.52089955],
[0.00820326, 0.99179674]])
I.e. the predictions for class 1 are 0.521 and 0.992.
If I use scikit-learn instead, I get 0.730 and 0.942:
import statsmodels.api as sm
x = [1, 2, 3, 4, 5]
y = [0,0,1,1,1]
model = sm.Logit(y, x)
results = model.fit()
results.summary()
results.predict([2.5, 7])
Out: array([0.73000205, 0.94185834])
(As a sidenote: if I use R instead of Python, the predictions are 0.480 and 1.000, i.e. they are, again, quite different.)
I suspect these differences are not numerical but there is an analytical mathematical reason behind, e.g. different functions that are used. Can someone help?
Thankss!
Upvotes: 3
Views: 1417
Reputation: 41
I have now found the solution. There were two reasons:
(1) scikit-learn uses regularisation by default, which has to be turned off. This is done by changing line 5 in the scikit-learn code to:
model = LogisticRegression(penalty='none')
(2) The one that Yati Raj mentioned - thanks for the hint! Statsmodels does not fit an intercept automatically. This can be changed by adding the line
x = sm.add_constant(x)
in the statsmodels code.
Upvotes: 1