Reputation: 113
I have a problem with glm function in R.
Specifically, I am not sure how to include nominal variables.
The results that I get in R after running the glm function are the following:
> df
x1 x2 y
1 a 2 0
2 b 4 1
3 a 4 0
4 b 2 1
5 a 4 1
6 b 2 0
> str(df)
'data.frame': 6 obs. of 3 variables:
$ x1: Factor w/ 2 levels "a","b": 1 2 1 2 1 2
$ x2: num 2 4 4 2 4 2
$ y: Factor w/ 2 levels "0","1": 1 2 1 2 2 1
Call:
glm(formula = y ~ x1 + x2, family = "binomial", data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -39.132 15208.471 -0.003 0.998
x1b 19.566 7604.236 0.003 0.998
x2 9.783 3802.118 0.003 0.998
However, when I run the LogitModelFit function in Wolfram Mathematica I get different parameters.
The code in Wolfram is provided below:
data = {{a, 2, 0}, {b, 4, 1}, {a, 4, 0}, {b, 2, 1}, {a, 4, 1}, {b, 2, 0}};
model = LogitModelFit[data, {x, y}, {x, y}, NominalVariables -> x]
model["BestFitParameters"]
And these are my estimated parameters:
{-18.5661, -18.5661, 9.28303}
model // Normal
1/(1 + E^(18.5661 - 9.28303 y + 18.5661 DiscreteIndicator[x, a, {a, b}]))
So, what is different here? Why the results differ so much?
Am I doing something wrong in R or in Wolfram?
Upvotes: 6
Views: 222
Reputation: 8105
You have effectively 4 groups, for which you are trying to estimate 3 parameters:
library(dplyr)
df %>% group_by(x1, x2) %>% summarise(n = n(), y = mean(y))
As you can see from the huge standard errors, the parameter estimates aren't stable. The standard errors for wolfram should also be very large (if they are given).
Second, wolfram, seems to use a different reference group, for x1:
> df$x1 <- relevel(df$x1, "b")
> m <- glm(y ~ x1 + x2, family = binomial(), data = df, control = list(maxit = 100))
> summary(m)
Call:
glm(formula = y ~ x1 + x2, family = binomial(), data = df, control = list(maxit = 100))
Deviance Residuals:
1 2 3 4 5 6
-0.00008 0.00008 -1.17741 1.17741 1.17741 -1.17741
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -19.566 7604.236 -0.003 0.998
x1a -19.566 7604.236 -0.003 0.998
x2 9.783 3802.118 0.003 0.998
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8.3178 on 5 degrees of freedom
Residual deviance: 5.5452 on 3 degrees of freedom
AIC: 11.545
Number of Fisher Scoring iterations: 18
This is much closer to the result of wolfram (this is actually the same model as you have found; I just choose another reference group).
The predictions for both models (glm and wolfram) will be practically equal. Actually any model with the first two parameters very small (best model will be -Inf) and the third parameter equal to half of the first two (9.783*2 = 19.566) will give almost the same result.
The factor two originates from the fact that x2 takes on value 2 and 4, which differ by two.
Upvotes: 4
Reputation: 2254
Seems like in your LogitModelFit
1/(1 + E^(18.5661 - 9.28303 y + 18.5661 DiscreteIndicator[x, a, {a, b}]))
the DiscreteIndicator refers to discrete variable matching condition x1 == 'a'
,
whereas in your glm
fit result there is instead a discrete variable x1b
matching condition x1 == 'b'
:
> str(df)
'data.frame': 6 obs. of 3 variables:
$ x1: Factor w/ 2 levels "a","b": 1 2 1 2 1 2
$ x2: num 2 4 4 2 4 2
$ y: Factor w/ 2 levels "0","1": 1 2 1 2 2 1
Call:
glm(formula = y ~ x1 + x2, family = "binomial", data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -39.132 15208.471 -0.003 0.998
x1b 19.566 7604.236 0.003 0.998
x2 9.783 3802.118 0.003 0.998
So the difference seems to be due to the different way in which LogitModelFit
and glm
exclude one dependent category. LogitModelFit
excludes dependent category x=='a'
whereas glm
excludes it's complement x=='b'
.
Upvotes: 3