Max Herre
Max Herre

Reputation: 67

How to handle missing values (NA's) in a column in lmer

I would like to use na.pass for na.action when working with lmer. There are NA values in some observations of the data set in some columns. I just want to control for this variables that contains the NA's. It is very important that the size of the data set will be the same after the control of the fixed effects. I think I have to work with na.action in lmer(). I am using the following model:

baseline_model_0 <- lmer(formula=log_life_time_income_child ~  nationality_dummy + 
    sex_dummy + region_dummy + political_position_dummy +(1|Family), data = baseline_df

Error in qr.default(X, tol = tol, LAPACK = FALSE) : NA/NaN/Inf in foreign function call (arg 1)

My data: as you see below, there are quite a lot of NA's in all the control variables. So "throwing" away all of these observations is no option!

One example:

nat_dummy
1   : 335
2   :  19
NA's: 252

My questions:

1.) How can I include all of my control variables (expressed in multiple columns) to the model without kicking out observations (expressed in rows)?

2.) How does lmer handle the missing variables in all the columns?

Upvotes: 3

Views: 3542

Answers (2)

Shawn Hemelstrand
Shawn Hemelstrand

Reputation: 3228

To answer your second question, lmer typically uses maximum likelihood, where it will estimate missing values of the dependent variable and kick out missing values of your predictors. To avoid this, as others have suggested, you can use multiple imputation instead. I demonstrate below an example with the airquality dataset native to R since you don't have your data included in your question. First, load the necessary libraries: lmerTest for fitting the regression, mice for imputation and broom.mixed for summarizing the results.

#### Load Libraries ####
library(lmerTest)
library(mice)
library(broom.mixed)

We can inspect the missing patterns with the next code:

#### Missing Patterns ####
md.pattern(airquality)

Which gives us this nice plot of all the missing data patterns. For example, you may notice that we have two observations that are missing both Ozone and Solar.R.

enter image description here

To fill in the gap, we can impute the data with 5 imputations (the default, so you don't have to include the m=5 part, but I specify explicitly for your understanding.

#### Impute ####
imp <- mice(airquality,
            m=5)

After, you run your imputations with the model like below. The with argument takes your imputed data and runs each imputation with the regression model. This model is a bit erroneous and comes back singular, but I just use it because its the quickest dataset I could remember with missing values included.

#### Fit With Imputations ####
fit <- with(imp,
            lmer(Solar.R ~ Ozone + (1|Month)))

From there you can pool and summarize your results like so:

#### Pool and Summarise ####
pool <- pool(fit)
summary(pool)

Obviously with the model being singular this would be meaningless, but with a proper fit model, your summary should look something like this:

         term    estimate  std.error statistic       df     p.value
1 (Intercept) 151.9805678 12.1533295 12.505262 138.8303 0.000000000
2       Ozone   0.8051218  0.2190679  3.675216 135.4051 0.000341446

As Ben already mentioned, you need to also determine why your data is missing. If there are non-random reasons for their missingness, this would require some consideration, as this can bias your imputations/model. I really recommend the mice vignettes here as a gentle introduction to the topic:

https://www.gerkovink.com/miceVignettes/

Edit

You asked in the comments about adding in random effects estimates. I'm not sure why this isn't already something ported into the respective packages already, but the mitml package can help fill that gap. Here is the code:

#### Load Library and Get All Estimates ####
library(mitml)
testEstimates(as.mitml.result(fit),
              extra.pars = T)

Which gives you both fixed and random effects for imputed lmer objects:

Call:

testEstimates(model = as.mitml.result(fit), extra.pars = T)

Final parameter estimates and inferences obtained from 5 imputed data sets.

             Estimate Std.Error   t.value        df   P(>|t|)       RIV       FMI 
(Intercept)   146.575    14.528    10.089    68.161     0.000     0.320     0.264 
Ozone           0.921     0.254     3.630    90.569     0.000     0.266     0.227 

                           Estimate 
Intercept~~Intercept|Month  112.587 
Residual~~Residual         7274.260 
ICC|Month                     0.015 

Unadjusted hypothesis test as appropriate in larger samples.

And if you just want to pull the random effects, you can use testEstimates(as.mitml.result(fit), extra.pars = T)$extra.pars instead, which gives you just the random effects:

                               Estimate
Intercept~~Intercept|Month 1.125872e+02
Residual~~Residual         7.274260e+03
ICC|Month                  1.522285e-02

Upvotes: 9

Ben Bolker
Ben Bolker

Reputation: 226097

Unfortunately there is no easy answer to your question; using na.pass doesn't do anything smart, it just lets the NA values go forward into the mixed-model machinery, where (as you have seen) they screw things up.

For most analysis types, in order to deal with missing values you need to use some form of imputation (using a model of some kind to fill in plausible values). If you only care about prediction without confidence intervals, you can use some simple single imputation method such as replacing NA values with means. If you want to do inference (compute p-values/confidence intervals), you need multiple imputation, i.e. generating multiple data sets with imputed values drawn differently in each one, fitting the model to each data set, then pooling estimates and confidence intervals appropriately across the fits.

mice is the standard/state-of-the-art R package for multiple imputation: there is an example of its use with lmer here.

There a bunch of questions you should ask/understand the answers to before you embark on any kind of analysis with missing data:

  • what kind of missingness do I have ("completely at random" [MCAR], "at random" [MAR], "not at random" [MNAR])? Can my missing-data strategy lead to bias if the data are missing not-at-random?
  • have I explored the pattern of missingness? Are there subsets of rows/columns that I can drop without much loss of information (e.g. if some column(s) or row(s) have mostly missing information, imputation won't help very much)
  • mice has a variety of imputation methods to choose from. It won't hurt to try out the default methods when you're getting started (as in @ShawnHemelstrand's answer), but before you go too far you should at least make sure you understand what methods mice is using on your data, and that the defaults make sense for your case.

I would strongly recommend the relevant chapter of Frank Harrell's Regression Modeling Strategies, if you can get ahold of a copy.

Upvotes: 3

Related Questions