Reputation: 67
I would like to use na.pass
for na.action
when working with lmer
. There are NA
values in some observations of the data set in some columns. I just want to control for this variables that contains the NA
's. It is very important that the size of the data set will be the same after the control of the fixed effects.
I think I have to work with na.action
in lmer()
. I am using the following model:
baseline_model_0 <- lmer(formula=log_life_time_income_child ~ nationality_dummy +
sex_dummy + region_dummy + political_position_dummy +(1|Family), data = baseline_df
Error in qr.default(X, tol = tol, LAPACK = FALSE) : NA/NaN/Inf in foreign function call (arg 1)
My data: as you see below, there are quite a lot of NA's in all the control variables. So "throwing" away all of these observations is no option!
One example:
nat_dummy
1 : 335
2 : 19
NA's: 252
My questions:
1.) How can I include all of my control variables (expressed in multiple columns) to the model without kicking out observations (expressed in rows)?
2.) How does lmer
handle the missing variables in all the columns?
Upvotes: 3
Views: 3542
Reputation: 3228
To answer your second question, lmer
typically uses maximum likelihood, where it will estimate missing values of the dependent variable and kick out missing values of your predictors. To avoid this, as others have suggested, you can use multiple imputation instead. I demonstrate below an example with the airquality
dataset native to R since you don't have your data included in your question. First, load the necessary libraries: lmerTest
for fitting the regression, mice
for imputation and broom.mixed
for summarizing the results.
#### Load Libraries ####
library(lmerTest)
library(mice)
library(broom.mixed)
We can inspect the missing patterns with the next code:
#### Missing Patterns ####
md.pattern(airquality)
Which gives us this nice plot of all the missing data patterns. For example, you may notice that we have two observations that are missing both Ozone
and Solar.R
.
To fill in the gap, we can impute the data with 5 imputations (the default, so you don't have to include the m=5 part, but I specify explicitly for your understanding.
#### Impute ####
imp <- mice(airquality,
m=5)
After, you run your imputations with the model like below. The with
argument takes your imputed data and runs each imputation with the regression model. This model is a bit erroneous and comes back singular, but I just use it because its the quickest dataset I could remember with missing values included.
#### Fit With Imputations ####
fit <- with(imp,
lmer(Solar.R ~ Ozone + (1|Month)))
From there you can pool and summarize your results like so:
#### Pool and Summarise ####
pool <- pool(fit)
summary(pool)
Obviously with the model being singular this would be meaningless, but with a proper fit model, your summary should look something like this:
term estimate std.error statistic df p.value
1 (Intercept) 151.9805678 12.1533295 12.505262 138.8303 0.000000000
2 Ozone 0.8051218 0.2190679 3.675216 135.4051 0.000341446
As Ben already mentioned, you need to also determine why your data is missing. If there are non-random reasons for their missingness, this would require some consideration, as this can bias your imputations/model. I really recommend the mice
vignettes here as a gentle introduction to the topic:
https://www.gerkovink.com/miceVignettes/
You asked in the comments about adding in random effects estimates. I'm not sure why this isn't already something ported into the respective packages already, but the mitml
package can help fill that gap. Here is the code:
#### Load Library and Get All Estimates ####
library(mitml)
testEstimates(as.mitml.result(fit),
extra.pars = T)
Which gives you both fixed and random effects for imputed lmer
objects:
Call:
testEstimates(model = as.mitml.result(fit), extra.pars = T)
Final parameter estimates and inferences obtained from 5 imputed data sets.
Estimate Std.Error t.value df P(>|t|) RIV FMI
(Intercept) 146.575 14.528 10.089 68.161 0.000 0.320 0.264
Ozone 0.921 0.254 3.630 90.569 0.000 0.266 0.227
Estimate
Intercept~~Intercept|Month 112.587
Residual~~Residual 7274.260
ICC|Month 0.015
Unadjusted hypothesis test as appropriate in larger samples.
And if you just want to pull the random effects, you can use testEstimates(as.mitml.result(fit), extra.pars = T)$extra.pars
instead, which gives you just the random effects:
Estimate
Intercept~~Intercept|Month 1.125872e+02
Residual~~Residual 7.274260e+03
ICC|Month 1.522285e-02
Upvotes: 9
Reputation: 226097
Unfortunately there is no easy answer to your question; using na.pass
doesn't do anything smart, it just lets the NA
values go forward into the mixed-model machinery, where (as you have seen) they screw things up.
For most analysis types, in order to deal with missing values you need to use some form of imputation (using a model of some kind to fill in plausible values). If you only care about prediction without confidence intervals, you can use some simple single imputation method such as replacing NA
values with means. If you want to do inference (compute p-values/confidence intervals), you need multiple imputation, i.e. generating multiple data sets with imputed values drawn differently in each one, fitting the model to each data set, then pooling estimates and confidence intervals appropriately across the fits.
mice
is the standard/state-of-the-art R package for multiple imputation: there is an example of its use with lmer
here.
There a bunch of questions you should ask/understand the answers to before you embark on any kind of analysis with missing data:
mice
has a variety of imputation methods to choose from. It won't hurt to try out the default methods when you're getting started (as in @ShawnHemelstrand's answer), but before you go too far you should at least make sure you understand what methods mice
is using on your data, and that the defaults make sense for your case.I would strongly recommend the relevant chapter of Frank Harrell's Regression Modeling Strategies, if you can get ahold of a copy.
Upvotes: 3