How to handle missing values (NA's) in a column in lmer

Question

I would like to use na.pass for na.action when working with lmer. There are NA values in some observations of the data set in some columns. I just want to control for this variables that contains the NA's. It is very important that the size of the data set will be the same after the control of the fixed effects. I think I have to work with na.action in lmer(). I am using the following model:

baseline_model_0 <- lmer(formula=log_life_time_income_child ~  nationality_dummy + 
    sex_dummy + region_dummy + political_position_dummy +(1|Family), data = baseline_df

Error in qr.default(X, tol = tol, LAPACK = FALSE) : NA/NaN/Inf in foreign function call (arg 1)

My data: as you see below, there are quite a lot of NA's in all the control variables. So "throwing" away all of these observations is no option!

One example:

nat_dummy
1   : 335
2   :  19
NA's: 252

My questions:

1.) How can I include all of my control variables (expressed in multiple columns) to the model without kicking out observations (expressed in rows)?

2.) How does lmer handle the missing variables in all the columns?

Shawn Hemelstrand · Accepted Answer

To answer your second question, lmer typically uses maximum likelihood, where it will estimate missing values of the dependent variable and kick out missing values of your predictors. To avoid this, as others have suggested, you can use multiple imputation instead. I demonstrate below an example with the airquality dataset native to R since you don't have your data included in your question. First, load the necessary libraries: lmerTest for fitting the regression, mice for imputation and broom.mixed for summarizing the results.

#### Load Libraries ####
library(lmerTest)
library(mice)
library(broom.mixed)

We can inspect the missing patterns with the next code:

#### Missing Patterns ####
md.pattern(airquality)

Which gives us this nice plot of all the missing data patterns. For example, you may notice that we have two observations that are missing both Ozone and Solar.R.

To fill in the gap, we can impute the data with 5 imputations (the default, so you don't have to include the m=5 part, but I specify explicitly for your understanding.

#### Impute ####
imp <- mice(airquality,
            m=5)

After, you run your imputations with the model like below. The with argument takes your imputed data and runs each imputation with the regression model. This model is a bit erroneous and comes back singular, but I just use it because its the quickest dataset I could remember with missing values included.

#### Fit With Imputations ####
fit <- with(imp,
            lmer(Solar.R ~ Ozone + (1|Month)))

From there you can pool and summarize your results like so:

#### Pool and Summarise ####
pool <- pool(fit)
summary(pool)

Obviously with the model being singular this would be meaningless, but with a proper fit model, your summary should look something like this:

         term    estimate  std.error statistic       df     p.value
1 (Intercept) 151.9805678 12.1533295 12.505262 138.8303 0.000000000
2       Ozone   0.8051218  0.2190679  3.675216 135.4051 0.000341446

As Ben already mentioned, you need to also determine why your data is missing. If there are non-random reasons for their missingness, this would require some consideration, as this can bias your imputations/model. I really recommend the mice vignettes here as a gentle introduction to the topic:

https://www.gerkovink.com/miceVignettes/

Edit

You asked in the comments about adding in random effects estimates. I'm not sure why this isn't already something ported into the respective packages already, but the mitml package can help fill that gap. Here is the code:

#### Load Library and Get All Estimates ####
library(mitml)
testEstimates(as.mitml.result(fit),
              extra.pars = T)

Which gives you both fixed and random effects for imputed lmer objects:

Call:

testEstimates(model = as.mitml.result(fit), extra.pars = T)

Final parameter estimates and inferences obtained from 5 imputed data sets.

             Estimate Std.Error   t.value        df   P(>|t|)       RIV       FMI 
(Intercept)   146.575    14.528    10.089    68.161     0.000     0.320     0.264 
Ozone           0.921     0.254     3.630    90.569     0.000     0.266     0.227 

                           Estimate 
Intercept~~Intercept|Month  112.587 
Residual~~Residual         7274.260 
ICC|Month                     0.015 

Unadjusted hypothesis test as appropriate in larger samples.

And if you just want to pull the random effects, you can use testEstimates(as.mitml.result(fit), extra.pars = T)$extra.pars instead, which gives you just the random effects:

                               Estimate
Intercept~~Intercept|Month 1.125872e+02
Residual~~Residual         7.274260e+03
ICC|Month                  1.522285e-02

How to handle missing values (NA's) in a column in lmer

Answers (2)

Edit

Related Questions

How to handle missing values (NA&#39;s) in a column in lmer

Answers (2)

Edit

Related Questions

How to handle missing values (NA's) in a column in lmer