Ahir Bhairav Orai
Ahir Bhairav Orai

Reputation: 317

How to use mice for multiple imputation of missing values in longitudinal data?

I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.

 Id    y       Date          Soda    Team
 1    -0.4521  1999-02-07    Coke    Eagles 
 1     0.2863  1999-04-15    Pepsi   Raiders
 2     0.7956  1999-07-07    Coke    Raiders
 2    -0.8248  1999-07-26    NA      Raiders 
 3     0.8830  1999-05-29    Pepsi   Eagles 
 4     0.1303  2005-03-04    NA      Cowboys
 5     0.1375  2013-11-02    Coke    Cowboys
 5     0.2851  2015-06-23    Coke    Eagles 
 5    -0.3538  2015-07-29    Pepsi   NA 
 6     0.3349  2002-10-11    NA      NA
 7    -0.1756  2005-01-11    Pepsi   Eagles
 7     0.5507  2007-10-16    Pepsi   Cowboys
 7     0.5132  2012-07-13    NA      Cowboys
 7    -0.5776  2017-11-25    Coke    Cowboys
 8     0.5486  2009-02-08    Coke    Cowboys

I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.

Edit

This is what I have tried so far, but this does not capture the repeated measure part of the dataset.

library(mice)

init = mice(dat, maxit=0)

methd = init$method

predM = init$predictorMatrix

methd [c("Soda")]="logreg"; 
methd [c("Team")]="logreg";  

imputed = mice(data, method=methd , predictorMatrix=predM, m=5)

Upvotes: 3

Views: 2711

Answers (1)

Dion Groothof
Dion Groothof

Reputation: 1456

There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.

Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.

# install.packages('JM')
library(JM)

# note: use function(x) instead of \(x) if you use a version of R <4.1.0

# missing values per column
miss_abs <- \(x) sum(is.na(x)) 
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
             Number    %
ascites          60  3.1
hepatomegaly     61  3.1
spiders          58  3.0
serChol         821 42.2
alkaline         60  3.1
platelets        73  3.8

According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.

# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))

# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
  id    years alkaline serBilir
1  1  1.09517     1718     14.5
2  1  1.09517     1612     21.3
3  2 14.15234     7395      1.1
4  2 14.15234     2107      0.8
5  2 14.15234     1711      1.0

Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.

set.seed(1)

# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
  add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
  c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)

# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))

# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)

Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.

# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
                    idvar = 'id',
                    timevar = "measurement_number",
                    v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
   id   years.1 alkaline.1 serBilir.1   years.2 alkaline.2 serBilir.2   years.3 alkaline.3
1   1  1.095170       1718       14.5  1.938557       1612       21.3        NA         NA
3   2 14.152338       7395        1.1 15.198249       2107        0.8 15.943431       1711
12  3  2.770781        516        1.4  3.694434        353        1.1  5.148726        218
16  4  5.270507       6122        1.8  6.115197       1175        1.6  6.716832       1157

Now let's multiply the missing values in our covariates.

library(mice)

# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence

# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
     grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
     grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
     grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
     grep("1|2|3", colnames(pred), value = TRUE)] <- 0

# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0

# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs

As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence. enter image description here

Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.

# need unlisted data
implong <- complete(imp, 'long', include = FALSE)

# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
  expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
  1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]

# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
                   reshape, direction = 'long',
                   varying = as.list(v_names),
                   v.names = names(v_names),
                   idvar = 'id', times = 1:4)

# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
                     varying = as.list(v_names),
                     v.names = names(v_names),
                     idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)

# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))

# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))

Finally, convert longlist back into a mids using datalist2mids from the miceadds package.

imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL

Upvotes: 2

Related Questions