Reputation: 666
I'm pretty sure it's a feature, not a bug, but I would like to know if there is a way to make sklearn
and statsmodels
match in their logit estimates. A very simple example:
import numpy as np
import statsmodels.formula.api as sm
from sklearn.linear_model import LogisticRegression
np.random.seed(123)
n = 100
y = np.random.random_integers(0, 1, n)
x = np.random.random((n, 2))
# Constant term
x[:, 0] = 1.
The estimates with statsmodels
:
sm_lgt = sm.Logit(y, x).fit()
Optimization terminated successfully.
Current function value: 0.675320
Iterations 4
print sm_lgt.params
[ 0.38442 -1.1429183]
And the estimates with sklearn
:
sk_lgt = LogisticRegression(fit_intercept=False).fit(x, y)
print sk_lgt.coef_
[[ 0.16546794 -0.72637982]]
I think it's got to do with the implementation in sklearn
, which uses some sort of regularization. Is there an option to estimate a barebones logit as in statsmodels
(it's substantially faster and scales much more nicely). Also, does sklearn
provide inference (standard errors) or marginal effects?
Upvotes: 4
Views: 5353
Reputation: 5698
As an additional note, I was struggling with difference in results when my matrix was collinear. Obviously this means there should be some additional preprocessing to get reliable result, but I was still hoping to find out why I got a result with sklearn but statsmodels errored out.
Short answer: setting solver='bfgs'
when calling fit
in statsmodels gives almost identical results to the sklearn model, even in the case of collinear variables (once one has taken care of the fact that default for sm is no intercept, and default for sklearn fits the intercept)
Example (adapted from a similar question on OLS):
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
np.random.seed = 237
num_samples=1000
X=np.random.random((num_samples, 2))
X[:, 1] = 2*X[:, 0]
X_sm = sm.add_constant(X)
beta = [1, -2, .5]
error = np.random.random(num_samples)
y = np.round(1/(1+np.exp( -(np.dot(X_sm, beta)) + error ))) # y = 1/(1+exp(-beta*x))
lr = LogisticRegression(C=1e9).fit(X, y)
print "sklearn:"
print lr.intercept_
print lr.coef_
print "statsmodels:"
print sm.Logit(y, X_sm).fit(method='bfgs').params # method='nm' or default method errors out
(PS If anyone has comments on the math behind these two solvers and the reliability of results I would love to hear it! I find it interesting that sklearn doesn't even throw a warning for this...)
Upvotes: 0
Reputation: 363497
Is there an option to estimate a barebones logit as in
statsmodels
You can set the C
(inverse regularization strength) parameter to an arbitrarily high constant, as long as it's finite:
>>> sk_lgt = LogisticRegression(fit_intercept=False, C=1e9).fit(x, y)
>>> print(sk_lgt.coef_)
[[ 0.38440594 -1.14287175]]
Turning the regularization off is impossible because this is not supported by the underlying solver, Liblinear.
Also, does
sklearn
provide inference (standard errors) or marginal effects?
No. There's a proposal to add this, but it's not in the master codebase yet.
Upvotes: 9