user17841309
user17841309

Reputation: 21

Running Latent Class Growth Analysis on multiple imputed dataset

For the data analysis of the project I'm working on, I have to perform latent class growth analysis (LCGA) to identify different trajectories of my outcome variable over time. Since I have a number of missing variables, I first have to use multiple imputation to construct complete datasets, which I can then use for the LCGA. I managed to figure out most of the problems/issues I ran into while learning how to perform the imputations and the LCGA, but I can't figure out how to run the LCGA on the imputed datasets and then pool the results to get the trajectories based on the pooled results.

Packages used:

library(tidyverse)
library(mice)
library(miceadds)
library(micemd)
library(parallel)
library(lcmm)

The code I have for the imputation is with dataset 'data':

set.seed(2555)
#create a predictormatrix to use in the mice function
pred <- quickpred(data, mincor = 0.5, minpuc = 0.3) 
#parallel mice function to speed up computation, using a CPU with 32 logical cores and maxit=0 to keep computational times low as I test the code
imputed <- parlmice(data, n.core = 31, n.imp.core = 1, pred = pred, maxit=0, cluster.seed = 2555) 

The problem I run into at this point is that the LCGA requires data in the long format to work. So at this point I have to convert the mids object created by the parlmice function to a dataframe. Then convert the dataframe to long format for LCGA and run the LCGA like this:

#create dataframe from mids object 'imputed'
lcmm <- complete(imputed, include = FALSE)

#convert dataframe 'lcmm' to long format based on the outcome variable measured at different time points
lcmmlong <- NA
lcmmlong <- lcmm %>% select(ID, x_1:x_14) %>% pivot_longer(
  cols = c('x_1': 'x_14'),
  names_to = "time",
  names_prefix = "x_",
  names_sep = NULL,
  names_pattern = NULL,
  names_ptypes = list(),
  names_transform = list(),
  names_repair = "check_unique",
  values_to = "x",
  values_drop_na = TRUE,
  values_ptypes = list(),
  values_transform = list()
)

#run lcga on the long formate dataframe
lcga1 <-hlme(x ~ time, random=~-1, subject = "ID", ng = 1, data = lcmmlong) 
lcga2 <-gridsearch(rep = 100, maxiter = 10, minit = lcga1,hlme(x ~ time, random=~-1, subject = "ID", ng = 2, data = lcmmlong, mixture = ~ time)) 
lcga3 <-gridsearch(rep = 100, maxiter = 10, minit = lcga1,hlme(x ~ time, random=~-1, subject = "ID", ng = 3, data = lcmmlong, mixture = ~ time))

This ofcourse does not work, because the complete() function throws all imputed datasets in the same dataframe. I can do that in long or broad format, but either way the LCGA can't run on this dataframe. Solution would be to run the LCGA on every individual dataset from the imputation and then pool the results. I've found solutions for this for several different analyses, such as the example from the 'mice' package itself:

imp <- mice(nhanes, seed = 123, print = FALSE)
fit <- with(imp, lm(chl ~ age + bmi + hyp))
est1 <- pool(fit)

However, I have no idea how I can add the transformation from broad format to long format dataframe and the different lcga steps to the mice(), with() and pool() workflow. From what I could find, I would have to manually run the LCGA on every imputed database individually and then manually pool the results.

#Extract imputed databases one by one for databases 1 through n.
lcmmn <- complete(imputed, action = n, include = FALSE)

And even then, running the lcga is one thing, but more actions are required to determine the optimal number of trajectories and so on and so forth. So my question is: How can I run the LCGA on the imputed databases and pool the results, which I can then use to determine the optimal number of trajectories?

Upvotes: 2

Views: 903

Answers (1)

Tom
Tom

Reputation: 1138

Small excursion on FIML

Let me first note that there is a second way of handling missing data using lavaan, i.e. the full information maximum likelihood (FIML, in the ?lavOptions manual see the options tag for missing). You can also include auxiliary variables with the wrapper function growth.auxiliary from the R package semTools.

Using lavaan on MI data

I'm not entirely certain how this can be accomplished using out of the box methods in mice or lavaan.

lavaan.survey. There was supposed to be a wrapper package called lavaan.survey. But lavaan advanced and the last update of lavaan.survey is from the year 2016. In recent approaches I reported a bug in the calculation of standard errors to the author of lavaan.survey which has yet to be fixed.

manually. Doing the calculations manually is totally possible:

  • the pooling of the coefficients (for example given by lavaan::parameterEstimates) is a fairly simple mean aggregate accross all imputed data

  • as for the standard errors you need to include two terms; one for the sampling variance and one for the imputation variance

    1. The sampling error can be the mean of the standard errors across all imputed data
    2. The imputation variance is calculated as (1+1/n) * sum_i^n(coef_i-mean(coef)) / (n-1)

BIFIEsurvey. There is, however, the R package called BIFIEsurvey which is designed to simplify such tasks. There even is a wrapper for lavaan using multiply imputed data, (edit) BIFIE.lavaan.survey (/edit) -- I didn't work with that to much and I think it can't handle the measurement parts of a SEM (edit: I contacted the maintainer of the package). But there is a possibility to define custom analysis functions. I illustrate using a custom example adapted from semtools

## linear growth model with a time-varying covariate
library(lavaan)
targetModel <- '
  # intercept and slope with fixed coefficients
    i =~ 1*t1 + 1*t2 + 1*t3 + 1*t4
    s =~ 0*t1 + 1*t2 + 2*t3 + 3*t4

  # regressions
    i ~ x1 + x2
    s ~ x1 + x2

  # time-varying covariates
    t1 ~ c1
    t2 ~ c2
    t3 ~ c3
    t4 ~ c4
'

fit <- growth(targetModel, data = Demo.growth)
summary(fit)

## using semTools (cf. ?growth.auxiliary)
library(semTools)
dat1 <- Demo.growth
dat1$z <- rnorm(nrow(dat1))
dat1$x1 <- ifelse(dat1$z < quantile(dat1$z, .3), NA, dat1$x1)
dat1$x2 <- ifelse(dat1$z > quantile(dat1$z, .8), NA, dat1$x2)

fitaux1 <- growth.auxiliary(targetModel, data = dat1, aux = "z",
                         missing = "fiml", estimator = "mlr")
summary(fitaux1)

## using BIFIEsurvey on multiply imputed data (cf. ?BIFIE.lavaan.survey)
imp <- mice(dat1, seed = 123, print = FALSE)
imp_list <- mice::complete(imp, action = "all", include = FALSE)
bdat <- BIFIE.data(imp_list)
bdat$dat1

userfct <- function(X, w) { #X <- bdat$dat1; w <- bdat$wgt;
  X1 <- as.data.frame(X)
  colnames(X1) <- bdat$varnames
  
  fit <- growth(model = targetModel, data = X1)
  
  lavaan_coef <- methods::getMethod("coef", "lavaan")
  est <- lavaan_coef(fit)
  
  return(c(est,
           lavaan::fitMeasures(fit)[c("cfi","tli")]))
}
parnames <- names(userfct(bdat$dat1, bdat$wgt))
fitbsurv <-  BIFIEsurvey::BIFIE.by(bdat, vars = bdat$varnames, userfct = userfct,
                                   userparnames = parnames)

summary(fitbsurv, digits = 3)

Note that in the case of BIFIE.by the estimated standard errors from lavaan are not passed to BIFIEsurvey. BIFIEsurvey is designed to work with complex replication designs. If you do not have such a design, you could use BIFIE.data.jack using a number of random jackknife groups (option jktype="JK_RANDOM").

Upvotes: 1

Related Questions