ackshooairy
ackshooairy

Reputation: 151

GLM R vs. Python

I am attempting to generate a logistic regression in Python that produces the same results as R. It seems close, but not the same. I made up the following example to illustrate that a difference exists. The data is not real.

R

# RStudio 1.1.453

d <- data.frame(c(0, 0, 1, 1, 1),
                c(1, 0, 0, 0, 0),
                c(0, 1, 0, 0, 0))

colnames(d) <- c("v1", "v2", "v3")

model <- glm(v1 ~ v2,
         data = d,
         family = "binomial")


summary(model)

R Output

Call:
glm(formula = v1 ~ v2, family = "binomial", data = d)

Deviance Residuals: 
       1         2         3         4         5  
-1.66511  -0.00013   0.75853   0.75853   0.75853  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    1.099      1.155   0.951    0.341
v2           -19.665   6522.639  -0.003    0.998

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6.7301  on 4  degrees of freedom
Residual deviance: 4.4987  on 3  degrees of freedom
AIC: 8.4987

Number of Fisher Scoring iterations: 17

Python

# Python 3.7.1

import pandas as pd # 0.23.4
import statsmodels.api as sm # 0.9.0
import statsmodels.formula.api as smf # 0.9.0

d = pd.DataFrame({"v1" : [0, 0, 1, 1, 1],
                  "v2" : [1, 0, 0, 0, 0],
                  "v3" : [0, 1, 0, 0, 0]})

model = smf.glm(formula = "v1 ~ v2",
               family=sm.families.Binomial(link = sm.genmod.families.links.logit),
               data=d
               ).fit()

model.summary()

Python Output

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                     v1   No. Observations:                    5
Model:                            GLM   Df Residuals:                        3
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2.2493
Date:                Wed, 07 Nov 2018   Deviance:                       4.4987
Time:                        15:17:52   Pearson chi2:                     4.00
No. Iterations:                    19   Covariance Type:             nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.0986      1.155      0.951      0.341      -1.165       3.362
v2           -21.6647   1.77e+04     -0.001      0.999   -3.48e+04    3.47e+04
==============================================================================

There is a difference in the number of iterations. From what I can tell, there is some convergence method which may be different between the two, but I don't understand. Is there some other setting I might be missing?

Upvotes: 4

Views: 2946

Answers (1)

Sam Mason
Sam Mason

Reputation: 16174

at a guess they have different tradeoffs with regard to numerical stability.

the variance of the v2 estimate is enormous which is probably causing them both to struggle… I'd say they've basically given the same answer, at least to the limits available with double precision arithmetic.

the R implementation allows you to pass a control parameter:

> options(digits=12)
> model <- glm(v1 ~ v2, data=d, family="binomial", control=list(trace=T))
Deviance = 4.67724333758 Iterations - 1
Deviance = 4.5570420311 Iterations - 2
Deviance = 4.51971688994 Iterations - 3
Deviance = 4.50636401333 Iterations - 4
Deviance = 4.50150009179 Iterations - 5
Deviance = 4.49971718523 Iterations - 6
Deviance = 4.49906215541 Iterations - 7
Deviance = 4.49882130019 Iterations - 8
Deviance = 4.4987327103 Iterations - 9
Deviance = 4.49870012203 Iterations - 10
Deviance = 4.49868813377 Iterations - 11
Deviance = 4.49868372357 Iterations - 12
Deviance = 4.49868210116 Iterations - 13
Deviance = 4.4986815043 Iterations - 14
Deviance = 4.49868128473 Iterations - 15
Deviance = 4.49868120396 Iterations - 16
Deviance = 4.49868117424 Iterations - 17

which displays its convergence, but I couldn't find anything similar in the Python code.

seeing the above output suggests that they could also be using different cutoffs to determine convergence; R uses epsilon = 1e-8

Upvotes: 2

Related Questions