Reputation: 850
Please consider the following:
When fitting a GEE with geepack
we receive a model that we can predict
with new values but base R does not support GEE models to calculate the confidence intervals. To obtain confidence intervals we can use emmeans::emmeans()
.
If the variables in the model are categorical and continuous I run into problems.
When estimating the marginal mean with emmeans::emmeans()
I found that the marginal mean is calculated with the overall data and not the data per group.
Question: how can I obtain the estimated mean per group including confidence intervals from a GEE model in R?
Minimal reproducible example:
Data
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library("emmeans")
#> Warning: package 'emmeans' was built under R version 3.5.2
library("geepack")
# Adding a grouping variable
pigs.group <- emmeans::pigs %>% mutate(group = c(rep("a", 20), rep("b", 9)))
Fitting the model
# Fitting the model
fit <- geepack::geeglm(conc ~ as.numeric(percent) + factor(group),
id = source, data = pigs.group)
# Model results
fit
#>
#> Call:
#> geepack::geeglm(formula = conc ~ as.numeric(percent) + factor(group),
#> data = pigs.group, id = source)
#>
#> Coefficients:
#> (Intercept) as.numeric(percent) factor(group)b
#> 20.498948 1.049322 10.703857
#>
#> Degrees of Freedom: 29 Total (i.e. Null); 26 Residual
#>
#> Scale Link: identity
#> Estimated Scale Parameters: [1] 36.67949
#>
#> Correlation: Structure = independence
#> Number of clusters: 3 Maximum cluster size: 10
Using emmeans::emmeans()
to calculate the marginal means and LCL/UCL. However, the group means for percent
are 12.9 in both groups. This is the overall observed mean of percent
and not the group mean.
# Calculating marginal means per group.
# Note that 'percent' is the same for both groups
emmeans::emmeans(fit, "percent", by = "group")
#> group = a:
#> percent emmean SE df asymp.LCL asymp.UCL
#> 12.9 34.1 3.252 Inf 27.7 40.4
#>
#> group = b:
#> percent emmean SE df asymp.LCL asymp.UCL
#> 12.9 44.8 0.327 Inf 44.1 45.4
#>
#> Covariance estimate used: vbeta
#> Confidence level used: 0.95
# Creating new data with acutal means per group
new.dat <- pigs.group %>%
group_by(group) %>%
summarise(percent = mean(percent))
# These are the actual group means
new.dat
#> # A tibble: 2 x 2
#> group percent
#> <chr> <dbl>
#> 1 a 13.2
#> 2 b 12.3
Predicting with predict
also returns other estiamted means per group, but no confidence intervals can be estimated for GEE in base R.
# Prediction with new data
# These should be the marginal means but how to get the confidence interval?
predict(fit, newdata = new.dat)
#> 1 2
#> 34.35000 44.14444
Created on 2019-02-08 by the reprex package (v0.2.1)
Upvotes: 1
Views: 1546
Reputation: 6760
What you thought was a computing issue turns out to be a statistical one...
When you have covariates in a model, the usual approach in post hoc analysis is to control for those covariates. In the context of the example given, we want to compare the mean response in different groups. However, the response is also affected by the covariate, percent
, and the mean percent differs with each group. If we just compute the marginal means for each group, those means differ in part because of the effects of percent
.
In an extreme example, imagine a situation where the group makes no difference whatsoever, but percent
does. Then if the mean percent
values differ enough among groups, then we could have statistically different means, but they would be different because of the effect of percent
, not because of the effect of group
.
For that reason, a "fair" comparison is obtained by predicting the means at the same percent -- say, the overall average percent in the dataset. That is the default method used in emmeans, and the results are termed adjusted means (look it up in a design textbook).
There is a situation where it is appropriate to use different percent values, and that is the case where percent is a "mediating variable;" that is, percent falls in the causal path between treatment and response, so that group
is believed to affect percent
as well as the response. See the vignette on messy data, in the subsection on mediating covariates.
If you truly think that percent
is a mediating covariate, then you can get separate percents like this:
emmeans(model, "group", cov.reduce = percent ~ group)
However, in a situation where percent
is regarded as being independent of group
, don't do this!
Upvotes: 4