Frederick
Frederick

Reputation: 850

Marginal means and confidence levels per group with emmeans and geepack in R

Please consider the following:

When fitting a GEE with geepack we receive a model that we can predict with new values but base R does not support GEE models to calculate the confidence intervals. To obtain confidence intervals we can use emmeans::emmeans().

If the variables in the model are categorical and continuous I run into problems.

When estimating the marginal mean with emmeans::emmeans() I found that the marginal mean is calculated with the overall data and not the data per group.

Question: how can I obtain the estimated mean per group including confidence intervals from a GEE model in R?


Minimal reproducible example:

Data

library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library("emmeans")
#> Warning: package 'emmeans' was built under R version 3.5.2
library("geepack")

# Adding a grouping variable
pigs.group <- emmeans::pigs %>% mutate(group = c(rep("a", 20), rep("b", 9)))

Fitting the model

# Fitting the model
fit <- geepack::geeglm(conc ~ as.numeric(percent) + factor(group),
                       id = source, data = pigs.group)

# Model results
fit
#> 
#> Call:
#> geepack::geeglm(formula = conc ~ as.numeric(percent) + factor(group), 
#>     data = pigs.group, id = source)
#> 
#> Coefficients:
#>         (Intercept) as.numeric(percent)      factor(group)b 
#>           20.498948            1.049322           10.703857 
#> 
#> Degrees of Freedom: 29 Total (i.e. Null);  26 Residual
#> 
#> Scale Link:                   identity
#> Estimated Scale Parameters:  [1] 36.67949
#> 
#> Correlation:  Structure = independence  
#> Number of clusters:   3   Maximum cluster size: 10

Using emmeans::emmeans() to calculate the marginal means and LCL/UCL. However, the group means for percent are 12.9 in both groups. This is the overall observed mean of percent and not the group mean.

# Calculating marginal means per group.
# Note that 'percent' is the same for both groups
emmeans::emmeans(fit, "percent", by = "group")
#> group = a:
#>  percent emmean    SE  df asymp.LCL asymp.UCL
#>     12.9   34.1 3.252 Inf      27.7      40.4
#> 
#> group = b:
#>  percent emmean    SE  df asymp.LCL asymp.UCL
#>     12.9   44.8 0.327 Inf      44.1      45.4
#> 
#> Covariance estimate used: vbeta 
#> Confidence level used: 0.95

# Creating new data with acutal means per group
new.dat <- pigs.group %>%
        group_by(group) %>%
        summarise(percent = mean(percent))

# These are the actual group means
new.dat
#> # A tibble: 2 x 2
#>   group percent
#>   <chr>   <dbl>
#> 1 a        13.2
#> 2 b        12.3

Predicting with predict also returns other estiamted means per group, but no confidence intervals can be estimated for GEE in base R.

# Prediction with new data
# These should be the marginal means but how to get the confidence interval?
predict(fit, newdata = new.dat)
#>        1        2 
#> 34.35000 44.14444

Created on 2019-02-08 by the reprex package (v0.2.1)

Upvotes: 1

Views: 1546

Answers (1)

Russ Lenth
Russ Lenth

Reputation: 6760

What you thought was a computing issue turns out to be a statistical one...

When you have covariates in a model, the usual approach in post hoc analysis is to control for those covariates. In the context of the example given, we want to compare the mean response in different groups. However, the response is also affected by the covariate, percent, and the mean percent differs with each group. If we just compute the marginal means for each group, those means differ in part because of the effects of percent.

In an extreme example, imagine a situation where the group makes no difference whatsoever, but percent does. Then if the mean percent values differ enough among groups, then we could have statistically different means, but they would be different because of the effect of percent, not because of the effect of group.

For that reason, a "fair" comparison is obtained by predicting the means at the same percent -- say, the overall average percent in the dataset. That is the default method used in emmeans, and the results are termed adjusted means (look it up in a design textbook).

There is a situation where it is appropriate to use different percent values, and that is the case where percent is a "mediating variable;" that is, percent falls in the causal path between treatment and response, so that group is believed to affect percent as well as the response. See the vignette on messy data, in the subsection on mediating covariates.

If you truly think that percent is a mediating covariate, then you can get separate percents like this:

 emmeans(model, "group", cov.reduce = percent ~ group)

However, in a situation where percent is regarded as being independent of group, don't do this!

Upvotes: 4

Related Questions