Reputation: 151
I am attempting to generate a logistic regression in Python that produces the same results as R. It seems close, but not the same. I made up the following example to illustrate that a difference exists. The data is not real.
R
# RStudio 1.1.453
d <- data.frame(c(0, 0, 1, 1, 1),
c(1, 0, 0, 0, 0),
c(0, 1, 0, 0, 0))
colnames(d) <- c("v1", "v2", "v3")
model <- glm(v1 ~ v2,
data = d,
family = "binomial")
summary(model)
R Output
Call:
glm(formula = v1 ~ v2, family = "binomial", data = d)
Deviance Residuals:
1 2 3 4 5
-1.66511 -0.00013 0.75853 0.75853 0.75853
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.099 1.155 0.951 0.341
v2 -19.665 6522.639 -0.003 0.998
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.7301 on 4 degrees of freedom
Residual deviance: 4.4987 on 3 degrees of freedom
AIC: 8.4987
Number of Fisher Scoring iterations: 17
Python
# Python 3.7.1
import pandas as pd # 0.23.4
import statsmodels.api as sm # 0.9.0
import statsmodels.formula.api as smf # 0.9.0
d = pd.DataFrame({"v1" : [0, 0, 1, 1, 1],
"v2" : [1, 0, 0, 0, 0],
"v3" : [0, 1, 0, 0, 0]})
model = smf.glm(formula = "v1 ~ v2",
family=sm.families.Binomial(link = sm.genmod.families.links.logit),
data=d
).fit()
model.summary()
Python Output
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: v1 No. Observations: 5
Model: GLM Df Residuals: 3
Model Family: Binomial Df Model: 1
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2.2493
Date: Wed, 07 Nov 2018 Deviance: 4.4987
Time: 15:17:52 Pearson chi2: 4.00
No. Iterations: 19 Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.0986 1.155 0.951 0.341 -1.165 3.362
v2 -21.6647 1.77e+04 -0.001 0.999 -3.48e+04 3.47e+04
==============================================================================
There is a difference in the number of iterations. From what I can tell, there is some convergence method which may be different between the two, but I don't understand. Is there some other setting I might be missing?
Upvotes: 4
Views: 2946
Reputation: 16174
at a guess they have different tradeoffs with regard to numerical stability.
the variance of the v2
estimate is enormous which is probably causing them both to struggle… I'd say they've basically given the same answer, at least to the limits available with double precision arithmetic.
the R implementation allows you to pass a control
parameter:
> options(digits=12)
> model <- glm(v1 ~ v2, data=d, family="binomial", control=list(trace=T))
Deviance = 4.67724333758 Iterations - 1
Deviance = 4.5570420311 Iterations - 2
Deviance = 4.51971688994 Iterations - 3
Deviance = 4.50636401333 Iterations - 4
Deviance = 4.50150009179 Iterations - 5
Deviance = 4.49971718523 Iterations - 6
Deviance = 4.49906215541 Iterations - 7
Deviance = 4.49882130019 Iterations - 8
Deviance = 4.4987327103 Iterations - 9
Deviance = 4.49870012203 Iterations - 10
Deviance = 4.49868813377 Iterations - 11
Deviance = 4.49868372357 Iterations - 12
Deviance = 4.49868210116 Iterations - 13
Deviance = 4.4986815043 Iterations - 14
Deviance = 4.49868128473 Iterations - 15
Deviance = 4.49868120396 Iterations - 16
Deviance = 4.49868117424 Iterations - 17
which displays its convergence, but I couldn't find anything similar in the Python code.
seeing the above output suggests that they could also be using different cutoffs to determine convergence; R uses epsilon = 1e-8
Upvotes: 2