Reputation: 47
I have a long-form dataframe that I am performing a poisson regression on.
'data.frame': 20000 obs. of 6 variables:
$ cal_y : int 2008 2008 2008 2008 2008 ...
$ age_y : int 0 0 0 0 0 0 0 0 0 0 ...
$ gender : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ cal_m : int 9 7 8 1 6 11 2 10 3 4 ...
$ n_outcome: int 276 187 164 352 229 250 332 267 348 291 ...
$ n_atrisk : int 4645 4645 4645 4645 4645 4645 4645 4645 4645 4645 ...
glm(n_outcome ~ factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = df, family =poisson)
I would like to know the coefficients of the exposure cal_y for the outcome n_outcome for every age_y and preferably be able to aggregate this information into one df.
I have tried several misguided versions of lapply() and tapply(). Currently, my best solution is to do this by hand:
glm(n_outcome ~ factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = filter(df, age_y >= 0, age_y <1), family =poisson)
But this is both tedious (range(age_y) = 0 105), the results are not easily combined in a new df and I'm not sure it is statistically correct to subset the data prior to performing the regression.
Any pointers, comments or help is appreciated.
Upvotes: 0
Views: 258
Reputation: 4907
What you describe in the comments above is what happens if you use effects coding vs a means model. The two are equivalent and just represent different constraints placed on the parameters (leave one out vs sum to 0). But it appears to be skewing your thinking here. If you use age_y
as a categorical coding you will get the appropriate output
By using subset-regression, you are not including all available information in each model, which is the statistically invalid approach. This also increases the type I error rate. You should use all available information in your model. Therefore, this is the correct specification:
# this is the default way that R handles things, leaving one level out.
glm(n_outcome ~ factor(age_y) + factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = df, family= poisson(link = "log"))
# In contrast, this will provide an estimate for each level of
# factor(age_y) where the test-statistic is if the coefficient
# is statistically different from zero.
glm(n_outcome ~ 0 + factor(age_y) + factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = df, family= poisson(link = "log"))
The above is correct unless you expect different effects for cal_y
, gender
, and/or n_atrisk
by each age_y
in which can you will need to get estimates of the interaction of these variables with age_y
(which can still be specified in a single model).
If you wish to test if the levels of age_y
are different from each other, you can test their contrasts.
That is, using (... 0 + factor(age_y) ...)
-- which sounds like your preference -- gives you information on each level of age_y
relative to the null hypothesis, but it does not provide a statistical test between say age_y:level1 vs age_y:level2
. To do this test, you need to test their contrasts.
Upvotes: 1
Reputation: 78620
You can do this with dplyr and my broom package:
library(dplyr)
library(broom)
results <- df %>%
group_by(age_y) %>%
do(tidy(glm(n_outcome ~ factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = ., family =poisson)))
This works because group_by
and do
performs the regression for each age_y
value, and then tidy
turns each regression into a data frame that can be recombined.
See the broom and dplyr vignette for more.
Upvotes: 2