Reputation: 850
Unfortunately, I cannot offer a completely reproducible example here because I cannot share the data. However, I hope someone can help me figuring out the following.
Data
My dataset has 134 columns and 2521 rows. For the analysis I want to perform a GEE (geepack::geeglm
) but the problem also occurs in a simple glm
. The columns of interest in the model are:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2521 obs. of 6 variables:
$ SUBJID : chr "01" "01" "01" "01" ...
$ util_trans : num 0 0 0 0 0 0 0 0.431 0.225 0.139 ...
$ base_utility : num 0 0 0 0 0 0 0 0.431 0.431 0.431 ...
$ trt_01 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 1 1 ...
$ priorreg_factor: Factor w/ 2 levels "1",">1": 1 1 1 1 1 1 1 1 1 1 ...
$ avisit_group : Factor w/ 4 levels "baseline","treatment",..: 1 2 2 2 2 3 3 1 2 2 ...
The model
I provide the code for glm()
since it is very similar to geepack::geeglm()
.
Fitting the model as follows returns an error:
glm(util_trans ~ I(base_utility) +
factor(trt_01) +
factor(priorreg_factor),
data = na.omit(db),
subset = avisit_group == "treatment")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
However, if I select only the necessary columns the model runs perfectly fine;
glm(util_trans ~ I(base_utility) +
factor(trt_01) +
factor(priorreg_factor),
data = na.omit(db %>% dplyr::select(SUBJID, util_trans, base_utility,
trt_01, priorreg_factor, avisit_group)),
subset = avisit_group == "treatment")
Call: glm(formula = util_trans ~ I(base_utility) + factor(trt_01) +
factor(priorreg_factor), data = na.omit(db.eq5.seq %>% dplyr::select(SUBJID,
util_trans, base_utility, trt_01, priorreg_factor, avisit_group)),
subset = avisit_group == "treatment")
Coefficients:
(Intercept) I(base_utility) factor(trt_01)1 factor(priorreg_factor)>1
0.09 0.1 0.02 0.2
Degrees of Freedom: 1118 Total (i.e. Null); 1115 Residual
Null Deviance: 32.89
Residual Deviance: 22.47 AIC: -1187
Be aware that I changed the value of the coefficients by hand to 'anonymise' them.
Why is there a difference in the outcome although the data and the function call remain the same?
Upvotes: 0
Views: 207
Reputation: 61983
You are calling na.omit on the entire dataframe. This will cause rows to be omitted if they contain any NA values. It seems this is causing enough rows to be dropped that you're left with just a single level left for at least one of your factors. Here is an example of that on a reduced scaled to illustrate
> dat <- data.frame(x = factor(c(1,1,1,2)), y = 1:4, unrelated = c(2,5,3,NA))
> dat
x y unrelated
1 1 1 2
2 1 2 5
3 1 3 3
4 2 4 NA
> na.omit(dat)
x y unrelated
1 1 1 2
2 1 2 5
3 1 3 3
> na.omit(dat[,c("x", "y")])
x y
1 1 1
2 1 2
3 1 3
4 2 4
Notice that when we used na.omit including the unrelated variable it dropped the only row that had the level "2" for x. If we explicitly pick the columns we care about it can keep that row in the data.
Upvotes: 1