GLM and GEEGLM only work with smaller/specific dataset

Question

Unfortunately, I cannot offer a completely reproducible example here because I cannot share the data. However, I hope someone can help me figuring out the following.

Data
My dataset has 134 columns and 2521 rows. For the analysis I want to perform a GEE (geepack::geeglm) but the problem also occurs in a simple glm. The columns of interest in the model are:

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   2521 obs. of  6 variables:
 $ SUBJID         : chr  "01" "01" "01" "01" ...
 $ util_trans     : num  0 0 0 0 0 0 0 0.431 0.225 0.139 ...
 $ base_utility   : num  0 0 0 0 0 0 0 0.431 0.431 0.431 ...
 $ trt_01         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 1 1 ...
 $ priorreg_factor: Factor w/ 2 levels "1",">1": 1 1 1 1 1 1 1 1 1 1 ...
 $ avisit_group   : Factor w/ 4 levels "baseline","treatment",..: 1 2 2 2 2 3 3 1 2 2 ...

The model
I provide the code for glm() since it is very similar to geepack::geeglm().

Fitting the model as follows returns an error:

glm(util_trans ~ I(base_utility) +
             factor(trt_01) +
             factor(priorreg_factor),
     data = na.omit(db),
     subset = avisit_group == "treatment")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

However, if I select only the necessary columns the model runs perfectly fine;

glm(util_trans ~ I(base_utility) +
             factor(trt_01) +
             factor(priorreg_factor),
     data = na.omit(db %>% dplyr::select(SUBJID, util_trans, base_utility,
                                          trt_01, priorreg_factor, avisit_group)),
     subset = avisit_group == "treatment")

Call:  glm(formula = util_trans ~ I(base_utility) + factor(trt_01) + 
    factor(priorreg_factor), data = na.omit(db.eq5.seq %>% dplyr::select(SUBJID, 
    util_trans, base_utility, trt_01, priorreg_factor, avisit_group)), 
    subset = avisit_group == "treatment")

Coefficients:
              (Intercept)            I(base_utility)            factor(trt_01)1  factor(priorreg_factor)>1  
                0.09                  0.1                  0.02                  0.2  

Degrees of Freedom: 1118 Total (i.e. Null);  1115 Residual
Null Deviance:      32.89 
Residual Deviance: 22.47    AIC: -1187

Be aware that I changed the value of the coefficients by hand to 'anonymise' them.

Why is there a difference in the outcome although the data and the function call remain the same?

Dason · Accepted Answer

You are calling na.omit on the entire dataframe. This will cause rows to be omitted if they contain any NA values. It seems this is causing enough rows to be dropped that you're left with just a single level left for at least one of your factors. Here is an example of that on a reduced scaled to illustrate

> dat <- data.frame(x = factor(c(1,1,1,2)), y = 1:4, unrelated = c(2,5,3,NA))
> dat
  x y unrelated
1 1 1         2
2 1 2         5
3 1 3         3
4 2 4        NA
> na.omit(dat)
  x y unrelated
1 1 1         2
2 1 2         5
3 1 3         3

> na.omit(dat[,c("x", "y")])
  x y
1 1 1
2 1 2
3 1 3
4 2 4

Notice that when we used na.omit including the unrelated variable it dropped the only row that had the level "2" for x. If we explicitly pick the columns we care about it can keep that row in the data.

GLM and GEEGLM only work with smaller/specific dataset

Answers (1)

Related Questions