Reputation: 5475
I am trying to use the speedglm
package for R to estimate regression models. In general the results are the same as using base R's glm
function, but speedglm
delivers unexpected behavior when I completely remove a given factor level from a data.frame. For example, see the code below:
dat1 <- data.frame(y=rnorm(100), x1=gl(5, 20))
dat2 <- subset(dat1, x1!=1)
glm("y ~ x1", dat2, family="gaussian")
Coefficients:
(Intercept) x13 x14 x15
-0.2497 0.6268 0.3900 0.2811
speedglm(as.formula("y ~ x1"), dat2)
Coefficients:
(Intercept) x12 x13 x14 x15
0.03145 -0.28114 0.34563 0.10887 NA
Here the two functions deliver different results because factor level x1==1
has been deleted from dat2
. Had I used dat1
instead the results would have been identical. Is there a way to make speedglm
act like glm
when processing data like dat2
?
Upvotes: 3
Views: 538
Reputation: 3286
The default behavior for glm with a factor independent variable is to use the first non-empty level as a reference category. It appears that speedglm is treating the last level as the reference category. To get comparable results, you can use relevel
in the call to glm:
set.seed(2)
dat1 <- data.frame(y=rnorm(100), x1=gl(5, 20))
dat2 <- subset(dat1, x1!=1)
glm(y ~ relevel(x1,"5"), dat2, family="gaussian")
Coefficients:
(Intercept) relevel(x1, "5")2 relevel(x1, "5")3 relevel(x1, "5")4
-0.27163 0.27135 0.36688 0.09934
speedglm(as.formula("y ~ x1"), dat2)
Coefficients:
(Intercept) x12 x13 x14 x15
-0.27163 0.27135 0.36688 0.09934 NA
Upvotes: 2
Reputation: 5913
Droplevels I think is the key.
str(droplevels(dat2))
vs. str(dat2)
- even though x1==1
is dropped it's still listed in the factor levels
So speedglm(as.formula("y ~ x1"), droplevels(dat2))
should equal glm("y ~ x1", dat2, family="gaussian")
Upvotes: 2