gir emee
gir emee

Reputation: 71

Use emmeans with dummy variables

I have data which looks like this:

dat <- data.frame(ID = rep(1:4, each = 4),
                  score = c(0, 0, -3, -5, 0, -4, -4, -4, -1, -1, -2, -3, 0, 1, -2, -2),
                  visit1 = c(1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0), 
                  visit2 = c(0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0), 
                  visit4 = c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1),
                  visit = c(v1, v1, v1, v1, v2, v2, v2, v2, v3, v3, v3, v3, v4, v4, v4, v4),
                  trt = c(A, A, A, A, A, A, A, A, P, P, P, P, P, P, P, P))

Now, I fitted a mixed model repeated measures using the mmrm package (https://cran.r-project.org/web/packages/mmrm/index.html) as follows:

fit <- mmrm(formula = score ~ visit1 + visit2 + visit4 + visit2:trt + visit4:trt + us(ID | visit), data = dat)

I am interested in the LS means for the treatment groups at the last visit (visit4/V4) and the contrast of this. With the following model fit2 <- mmrm(formula = score ~ visit + visit:trt + us(ID | visit), data = dat) I can simply do emmeans(m1, specs = pairwise ~ treat | visit). However, if I try to do this for the case where I use the 0/1 variables for visit (visit1, visit2, and visit4) I get an error: Error in h_df_1d_list(est = est, var = var, v_num = v_num, v_denom = v_denom) : Assertion on 'var' failed: Element 1 is not >= 2.22507e-308.

I have two questions about this all:

  1. What does the error message mean?
  2. How do I get the LS means and the contrast between the LS means for and between treatment groups at the last measurement (visit4) in case I want to use the model with the dummy variables for visit?

Upvotes: 1

Views: 500

Answers (3)

Russ Lenth
Russ Lenth

Reputation: 6810

This is a pretty long answer but maybe it will help pin down what are the issues with hard-coded dummy variables.

emmeans is designed in a way that you really should code all factors as class factor (or ordinal). For those factors that you want marginal means for, you have no choice about this. However, for those who insist on manually coding factors using dummy variables, emmeans will work with those, but not in the same way. Here, I’ll work through an illustrative example to show the difference. The following code simulates a somewhat unbalanced dataset with factors treat (3 levels) and dose (4 levels).

set.seed(24.0306)
dat = expand.grid(treat = factor(LETTERS[1:3]), dose = factor(0:3))
dat = rbind(dat[1:3, ], dat)   # more observations for dose 0
dat = dat[-sample(1:15, 2), ]  # "lose" a couple of observations
### frequencies of factor combinations
with(dat, table(treat, dose))

##      dose
## treat 0 1 2 3
##     A 2 0 1 1
##     B 2 1 1 1
##     C 1 1 1 1

teff = c(-4, 7, -3)      # true treat effects
deff = c(-6, -2, 3, 5)   # true dose effects
dat$y = apply(cbind(as.numeric(dat$treat), as.numeric(dat$dose)), 1, function(idx) {
    34 + teff[idx[1]] + deff[idx[2]] + .6 * rnorm(1)
})

Now we will fit two models, one with dose as a factor and one with dose manually coded with the same dummy variables:

dat = transform(dat,
            d0 = 0 + (dose == "0"),
            d1 = 0 + (dose == "1"),
            d2 = 0 + (dose == "2"),
            d3 = 0 + (dose == "3"))
options(contrasts = c("contr.treatment", "contr.poly"))
mod.fac = lm(y ~ treat + dose, dat)
mod.dum = lm(y ~ treat + d1 + d2 + d3, dat)

We used exactly the same coding, so the coefficients are the same:

coef(mod.fac)

## (Intercept)      treatB      treatC       dose1       dose2       dose3 
##  24.2031040  11.1261777   0.6237625   3.9153530   8.4476560  10.5616421

coef(mod.dum)

## (Intercept)      treatB      treatC          d1          d2          d3 
##  24.2031040  11.1261777   0.6237625   3.9153530   8.4476560  10.5616421

Now compare the EMMs

library(emmeans)
(emm.fac = emmeans(mod.fac, "treat"))

##  treat emmean    SE df lower.CL upper.CL
##  A       29.9 0.215  7     29.4     30.4
##  B       41.1 0.179  7     40.6     41.5
##  C       30.6 0.197  7     30.1     31.0
## 
## Results are averaged over the levels of: dose 
## Confidence level used: 0.95

(emm.dum = emmeans(mod.dum, "treat"))

##  treat emmean    SE df lower.CL upper.CL
##  A       35.7 0.316  7     34.9     36.4
##  B       46.8 0.273  7     46.1     47.4
##  C       36.3 0.263  7     35.7     36.9
## 
## Results are averaged over the levels of: d1, d2, d3 
## Confidence level used: 0.95

The EMMs are not the same. What gives? Well, I hope you remember that EMMs are marginal averages of a reference grid, so let’s look at those:

ref_grid(mod.fac)

## 'emmGrid' object with variables:
##     treat = A, B, C
##     dose = 0, 1, 2, 3

ref_grid(mod.dum)

## 'emmGrid' object with variables:
##     treat = A, B, C
##     d1 = 0, 1
##     d2 = 0, 1
##     d3 = 0, 1

Hmmmm. So each one of the dummy variables is by default regarded as a factor having two levels. This means that your manual dummy coding makes this look like a four-factor experiment instead of a two-factor one. That makes a difference!

Another way to see what’s happening is to obtain the linear functions actually being computed. This information is in the linfct slot of the object created by emmeans(). This slot comprises a matrix, and each row of the matrix contains the weights that are applied to each regression coefficient. Recall that the regression coefficients themselves are the same for the two models, so if we were to get the same linear functions, we’ll get the same EMMs. Here they are:

emm.fac@linfct

##      (Intercept) treatB treatC dose1 dose2 dose3
## [1,]           1      0      0  0.25  0.25  0.25
## [2,]           1      1      0  0.25  0.25  0.25
## [3,]           1      0      1  0.25  0.25  0.25

emm.dum@linfct

##      (Intercept) treatB treatC  d1  d2  d3
## [1,]           1      0      0 0.5 0.5 0.5
## [2,]           1      1      0 0.5 0.5 0.5
## [3,]           1      0      1 0.5 0.5 0.5

Interesting. So each dose coefficient gets twice the weight for mod.dum as for mod.dum, while both give the same weight to the intercept and treat coefficients. The dummy coding is going to produce wrong answers.

There’s a third interesting possibility. As we saw, emmeans by default views any covariate having just two different values as a factor. This is a convenience because often we want indicators to be viewed as factors. But we can force them to be treated as covariates instead:

ref_grid(mod.dum, cov.keep = "")

## 'emmGrid' object with variables:
##     treat = A, B, C
##     d1 = 0.15385
##     d2 = 0.23077
##     d3 = 0.23077

With this setup, the EMMs are not really averages anymore; they are predictions for the three treatments with the covariates d1, d2, and d3 set to the above values, leading to a third set of EMMs

(emm.dum2 = emmeans(mod.dum, "treat", cov.keep = ""))

##  treat emmean    SE df lower.CL upper.CL
##  A       29.2 0.203  7     28.7     29.7
##  B       40.3 0.177  7     39.9     40.7
##  C       29.8 0.200  7     29.3     30.3
## 
## Confidence level used: 0.95

emm.dum2@linfct

##      (Intercept) treatB treatC        d1        d2        d3
## [1,]           1      0      0 0.1538462 0.2307692 0.2307692
## [2,]           1      1      0 0.1538462 0.2307692 0.2307692
## [3,]           1      0      1 0.1538462 0.2307692 0.2307692

Now since the dummies are just 0s and 1s, the values of these in the reference grid are just the fractions of each that are equal to 1. So this suggests how we can get the above results from mod.fac:

emmeans(mod.fac, "treat", weights = "prop")

##  treat emmean    SE df lower.CL upper.CL
##  A       29.2 0.203  7     28.7     29.7
##  B       40.3 0.177  7     39.9     40.7
##  C       29.8 0.200  7     29.3     30.3
## 
## Results are averaged over the levels of: dose 
## Confidence level used: 0.95

That is, manually coded dummies, when treated as covariates, yield proportionally-weighted EMMs instead of the default equally-weighted ones. We can turn this around by modifying what covariate values to use with the dummy coding so that we get the equally-weighted EMMs:

emmeans(mod.dum, "treat", cov.keep = "", cov.reduce = function(x) 0.25)

##  treat emmean    SE df lower.CL upper.CL
##  A       29.9 0.215  7     29.4     30.4
##  B       41.1 0.179  7     40.6     41.5
##  C       30.6 0.197  7     30.1     31.0
## 
## Confidence level used: 0.95

Aha! We have come full circle, as this is the first set of EMMs we obtained from mod.fac.

Summary

All this shows that:

  • You may not use manual dummy coding for the factor for which you want estimated marginal means
  • For factors to be marginalized out, dummy coding can create false factors that yield incorrect EMMs
  • If you use cov.keep = "" with dummy coding, you will obtain proportionately-weighted EMMs, provided that your manual coding corresponds to contr.treatment coding.

All in all, life is really much simpler, in my opinion, if you avoid using manualy-created dummies.

Left as an exercise:

What happens if you use contr.sum dummies: e1 = d1 - d0, e2 = d2 - d0, e3 = d3 - d0? Hint: In a way, this is simpler because you don’t need to specify cov.keep, as these dummies have three different values and will be treated as covariates from the get-go.

Upvotes: 2

Daniel Sabanes Bove
Daniel Sabanes Bove

Reputation: 125

Thanks for using the mmrm package!

Feel free to post questions or problems also on our GitHub issues at https://github.com/openpharma/mmrm/issues. We will see the questions there much faster than here (I just stumbled across this post accidentally).

Regarding your question, I tried to make this into a full reproducible example, see below. (Note that in the model formula you most likely don't want the 0/1 dummy variables but instead the factor visit interaction with trt)

v1 <- "visit1"
v2 <- "visit2"
v3 <- "visit3"
v4 <- "visit4"

A <- "Active"
P <- "Placebo"

dat <- data.frame(ID = rep(1:4, each = 4),
                  score = c(0, 0, -3, -5, 0, -4, -4, -4, -1, -1, -2, -3, 0, 1, -2, -2),
                  visit1 = c(1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0), 
                  visit2 = c(0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0), 
                  visit4 = c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1),
                  visit = c(v1, v1, v1, v1, v2, v2, v2, v2, v3, v3, v3, v3, v4, v4, v4, v4),
                  trt = c(A, A, A, A, A, A, A, A, P, P, P, P, P, P, P, P))

library(mmrm)
fit <- mmrm(formula = score ~ visit:trt + us(ID | visit), data = dat)

However, the problem that is detected by the most recent (0.2.2) mmrm package version is shown then with the following error message:

  time points have to be unique for each subject, detected following duplicates in data:
    visit ID
2  visit1  1
3  visit1  1
4  visit1  1
6  visit2  2
7  visit2  2
8  visit2  2
10 visit3  3
11 visit3  3
12 visit3  3
14 visit4  4
15 visit4  4
16 visit4  4

In general, the MMRM only makes sense (and is allowed by the package) when you have unique visits (time points) per individual. The rationale is that if you had duplicate time points (like twice visit 1) it is not clear how the correlation between these should be modelled.

So, bottom line, I think the mmrm fit itself is already erroneous here. In addition the emmeans comments from Russ can be helpful, thanks a lot!

Does this help?

Upvotes: 1

Russ Lenth
Russ Lenth

Reputation: 6810

This fails because there is no predictor named visit in your model.

emmeans() will not work with dummy variables as predictors. The design of emmeans() is to use a reference grid for all combinations of predictor values, and it makes no sense to make a prediction for visit1 = 1 and visit2 = 1, for example.

Re-fit your model with factors as predictors. Or with score as a multivariate response. I don't know what mmrm() requires, but if it doesn't support one of those possibilities, it should.

Upvotes: 1

Related Questions