Reputation: 477
I am using the lme4 package and the lmer function in R as I am undertaking linear Mixed Modelling. Is it okay to have NAs in both the response and covariate (predictor) level? I know Linear Mixed modelling excludes the NAs and uses maximum likelihood estimates but I am not sure whether NAs can exist in both the response and predictor variables? If I don't exclude NAs my modelling runs fine but I notice uneven response groups (for the different time points)? Does this matter.
E.g. at baseline (n= 1,980) at 1 month time point (n = 1,841) etc...
Background to my data includes patient data collected at 4 different time points (this is the response variable). There are list of patient characteristics (covariates/predictor variables) included in the model. These include BMI, age, presence of diabetes, blood pressure, radiation dose etc... Some patient data wasn't collected during follow-up so there are missing data (1666) in the dataframe.
Upvotes: 2
Views: 1832
Reputation: 226097
lme4
, and most statistical models available in R, use complete case analysis, i.e. they automatically drop observations with NA
values in the response or in any of the predictor variables.
Dealing with missing data is a complex subject. The most typical approach, if the missing data are compromising your ability to analyze the data, is to do multiple imputation (e.g. using the mice
package). This takes some effort, though. I would recommend Frank Harrell's Regression Modeling Strategies for a practical introduction to handling missing data in a biostatistical context.
In general uneven sampling/lack of balance should not cause a problem for analyses with mixed models, unless the data are so unbalanced that there are completely missing categories.
In comments, you say that you've
read through a lot of sources which suggest that Linear Mixed models can handle NAs
This is sort of true, but requires some explanation. In typical multivariate ANOVA contexts, the data are set up in wide format, i.e. one row for each group. For example:
id t1 t2 t3 t4 t5
A 1.0 2.0 3.0 2.1 7.2
B 1.1 1.9 2.4 2.3 1.4
...
In this format, missing or unbalanced data would appear as an NA
; if we didn't have an observation for group A
at time t5
, that row of the data would be 1.0 2.0 3.0 2.1 NA
. However, in mixed-model-world we usually represent the data in long format:
id time value
A t1 1.0
A t2 2.0
A t3 3.0
A t4 2.1
...
so in this case we wouldn't even include the missing observation in the first place (the default na.action
setting, "na.omit", will automatically drop incomplete cases; setting na.action = na.exclude
may be more convenient when making predictions etc.). In the MANOVA world we would have to decide how to deal with a group where any of the observations are missing; in the mixed-model world this corresponds to discarding a single group/time observation, corresponding to the complete-case analysis I described at the beginning.
Upvotes: 4