dais.johns
dais.johns

Reputation: 47

Perform poisson regression for each value in column

I have a long-form dataframe that I am performing a poisson regression on.

 'data.frame':  20000 obs. of  6 variables:
 $ cal_y  : int  2008 2008 2008 2008 2008 ...
 $ age_y  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ gender : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ cal_m  : int  9 7 8 1 6 11 2 10 3 4 ...
 $ n_outcome: int  276 187 164 352 229 250 332 267 348 291 ...
 $ n_atrisk : int  4645 4645 4645 4645 4645 4645 4645 4645 4645 4645 ...

glm(n_outcome ~ factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = df, family =poisson)

I would like to know the coefficients of the exposure cal_y for the outcome n_outcome for every age_y and preferably be able to aggregate this information into one df.

I have tried several misguided versions of lapply() and tapply(). Currently, my best solution is to do this by hand:

glm(n_outcome ~ factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = filter(df, age_y >= 0, age_y <1), family =poisson)

But this is both tedious (range(age_y) = 0 105), the results are not easily combined in a new df and I'm not sure it is statistically correct to subset the data prior to performing the regression.

Any pointers, comments or help is appreciated.

Upvotes: 0

Views: 258

Answers (2)

alexwhitworth
alexwhitworth

Reputation: 4907

What you describe in the comments above is what happens if you use effects coding vs a means model. The two are equivalent and just represent different constraints placed on the parameters (leave one out vs sum to 0). But it appears to be skewing your thinking here. If you use age_y as a categorical coding you will get the appropriate output

By using subset-regression, you are not including all available information in each model, which is the statistically invalid approach. This also increases the type I error rate. You should use all available information in your model. Therefore, this is the correct specification:

# this is the default way that R handles things, leaving one level out.
glm(n_outcome ~ factor(age_y) + factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = df, family= poisson(link = "log"))

# In contrast, this will provide an estimate for each level of
# factor(age_y) where the test-statistic is if the coefficient
# is statistically different from zero.
glm(n_outcome ~ 0 + factor(age_y) + factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
data = df, family= poisson(link = "log"))

The above is correct unless you expect different effects for cal_y, gender, and/or n_atrisk by each age_y in which can you will need to get estimates of the interaction of these variables with age_y (which can still be specified in a single model).

If you wish to test if the levels of age_y are different from each other, you can test their contrasts.

That is, using (... 0 + factor(age_y) ...) -- which sounds like your preference -- gives you information on each level of age_y relative to the null hypothesis, but it does not provide a statistical test between say age_y:level1 vs age_y:level2. To do this test, you need to test their contrasts.

Upvotes: 1

David Robinson
David Robinson

Reputation: 78620

You can do this with dplyr and my broom package:

library(dplyr)
library(broom)

results <- df %>%
  group_by(age_y) %>%
  do(tidy(glm(n_outcome ~ factor(cal_y) + factor(cal_m) + gender + offset(log(n_atrisk)),
              data = ., family =poisson)))

This works because group_by and do performs the regression for each age_y value, and then tidy turns each regression into a data frame that can be recombined.

See the broom and dplyr vignette for more.

Upvotes: 2

Related Questions