GLM R vs. Python

Question

I am attempting to generate a logistic regression in Python that produces the same results as R. It seems close, but not the same. I made up the following example to illustrate that a difference exists. The data is not real.

R

# RStudio 1.1.453

d <- data.frame(c(0, 0, 1, 1, 1),
                c(1, 0, 0, 0, 0),
                c(0, 1, 0, 0, 0))

colnames(d) <- c("v1", "v2", "v3")

model <- glm(v1 ~ v2,
         data = d,
         family = "binomial")


summary(model)

R Output

Call:
glm(formula = v1 ~ v2, family = "binomial", data = d)

Deviance Residuals: 
       1         2         3         4         5  
-1.66511  -0.00013   0.75853   0.75853   0.75853  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    1.099      1.155   0.951    0.341
v2           -19.665   6522.639  -0.003    0.998

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6.7301  on 4  degrees of freedom
Residual deviance: 4.4987  on 3  degrees of freedom
AIC: 8.4987

Number of Fisher Scoring iterations: 17

Python

# Python 3.7.1

import pandas as pd # 0.23.4
import statsmodels.api as sm # 0.9.0
import statsmodels.formula.api as smf # 0.9.0

d = pd.DataFrame({"v1" : [0, 0, 1, 1, 1],
                  "v2" : [1, 0, 0, 0, 0],
                  "v3" : [0, 1, 0, 0, 0]})

model = smf.glm(formula = "v1 ~ v2",
               family=sm.families.Binomial(link = sm.genmod.families.links.logit),
               data=d
               ).fit()

model.summary()

Python Output

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                     v1   No. Observations:                    5
Model:                            GLM   Df Residuals:                        3
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2.2493
Date:                Wed, 07 Nov 2018   Deviance:                       4.4987
Time:                        15:17:52   Pearson chi2:                     4.00
No. Iterations:                    19   Covariance Type:             nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.0986      1.155      0.951      0.341      -1.165       3.362
v2           -21.6647   1.77e+04     -0.001      0.999   -3.48e+04    3.47e+04
==============================================================================

There is a difference in the number of iterations. From what I can tell, there is some convergence method which may be different between the two, but I don't understand. Is there some other setting I might be missing?

GLM R vs. Python

Answers (1)

Related Questions