Reputation: 21
For the data analysis of the project I'm working on, I have to perform latent class growth analysis (LCGA) to identify different trajectories of my outcome variable over time. Since I have a number of missing variables, I first have to use multiple imputation to construct complete datasets, which I can then use for the LCGA. I managed to figure out most of the problems/issues I ran into while learning how to perform the imputations and the LCGA, but I can't figure out how to run the LCGA on the imputed datasets and then pool the results to get the trajectories based on the pooled results.
Packages used:
library(tidyverse)
library(mice)
library(miceadds)
library(micemd)
library(parallel)
library(lcmm)
The code I have for the imputation is with dataset 'data':
set.seed(2555)
#create a predictormatrix to use in the mice function
pred <- quickpred(data, mincor = 0.5, minpuc = 0.3)
#parallel mice function to speed up computation, using a CPU with 32 logical cores and maxit=0 to keep computational times low as I test the code
imputed <- parlmice(data, n.core = 31, n.imp.core = 1, pred = pred, maxit=0, cluster.seed = 2555)
The problem I run into at this point is that the LCGA requires data in the long format to work. So at this point I have to convert the mids object created by the parlmice function to a dataframe. Then convert the dataframe to long format for LCGA and run the LCGA like this:
#create dataframe from mids object 'imputed'
lcmm <- complete(imputed, include = FALSE)
#convert dataframe 'lcmm' to long format based on the outcome variable measured at different time points
lcmmlong <- NA
lcmmlong <- lcmm %>% select(ID, x_1:x_14) %>% pivot_longer(
cols = c('x_1': 'x_14'),
names_to = "time",
names_prefix = "x_",
names_sep = NULL,
names_pattern = NULL,
names_ptypes = list(),
names_transform = list(),
names_repair = "check_unique",
values_to = "x",
values_drop_na = TRUE,
values_ptypes = list(),
values_transform = list()
)
#run lcga on the long formate dataframe
lcga1 <-hlme(x ~ time, random=~-1, subject = "ID", ng = 1, data = lcmmlong)
lcga2 <-gridsearch(rep = 100, maxiter = 10, minit = lcga1,hlme(x ~ time, random=~-1, subject = "ID", ng = 2, data = lcmmlong, mixture = ~ time))
lcga3 <-gridsearch(rep = 100, maxiter = 10, minit = lcga1,hlme(x ~ time, random=~-1, subject = "ID", ng = 3, data = lcmmlong, mixture = ~ time))
This ofcourse does not work, because the complete() function throws all imputed datasets in the same dataframe. I can do that in long or broad format, but either way the LCGA can't run on this dataframe. Solution would be to run the LCGA on every individual dataset from the imputation and then pool the results. I've found solutions for this for several different analyses, such as the example from the 'mice' package itself:
imp <- mice(nhanes, seed = 123, print = FALSE)
fit <- with(imp, lm(chl ~ age + bmi + hyp))
est1 <- pool(fit)
However, I have no idea how I can add the transformation from broad format to long format dataframe and the different lcga steps to the mice(), with() and pool() workflow. From what I could find, I would have to manually run the LCGA on every imputed database individually and then manually pool the results.
#Extract imputed databases one by one for databases 1 through n.
lcmmn <- complete(imputed, action = n, include = FALSE)
And even then, running the lcga is one thing, but more actions are required to determine the optimal number of trajectories and so on and so forth. So my question is: How can I run the LCGA on the imputed databases and pool the results, which I can then use to determine the optimal number of trajectories?
Upvotes: 2
Views: 903
Reputation: 1138
Let me first note that there is a second way of handling missing data using lavaan, i.e. the full information maximum likelihood (FIML, in the ?lavOptions
manual see the options tag for missing
). You can also include auxiliary variables with the wrapper function growth.auxiliary
from the R package semTools
.
I'm not entirely certain how this can be accomplished using out of the box methods in mice
or lavaan
.
lavaan.survey. There was supposed to be a wrapper package called lavaan.survey
. But lavaan
advanced and the last update of lavaan.survey
is from the year 2016. In recent approaches I reported a bug in the calculation of standard errors to the author of lavaan.survey
which has yet to be fixed.
manually. Doing the calculations manually is totally possible:
the pooling of the coefficients (for example given by lavaan::parameterEstimates
) is a fairly simple mean
aggregate accross all imputed data
as for the standard errors you need to include two terms; one for the sampling variance and one for the imputation variance
mean
of the standard errors across all imputed data(1+1/n) * sum_i^n(coef_i-mean(coef)) / (n-1)
BIFIEsurvey. There is, however, the R
package called BIFIEsurvey
which is designed to simplify such tasks. There even is a wrapper for lavaan
using multiply imputed data, (edit) BIFIE.lavaan.survey
(/edit) -- I didn't work with that to much and I think it can't handle the measurement parts of a SEM (edit: I contacted the maintainer of the package). But there is a possibility to define custom analysis functions. I illustrate using a custom example adapted from semtools
## linear growth model with a time-varying covariate
library(lavaan)
targetModel <- '
# intercept and slope with fixed coefficients
i =~ 1*t1 + 1*t2 + 1*t3 + 1*t4
s =~ 0*t1 + 1*t2 + 2*t3 + 3*t4
# regressions
i ~ x1 + x2
s ~ x1 + x2
# time-varying covariates
t1 ~ c1
t2 ~ c2
t3 ~ c3
t4 ~ c4
'
fit <- growth(targetModel, data = Demo.growth)
summary(fit)
## using semTools (cf. ?growth.auxiliary)
library(semTools)
dat1 <- Demo.growth
dat1$z <- rnorm(nrow(dat1))
dat1$x1 <- ifelse(dat1$z < quantile(dat1$z, .3), NA, dat1$x1)
dat1$x2 <- ifelse(dat1$z > quantile(dat1$z, .8), NA, dat1$x2)
fitaux1 <- growth.auxiliary(targetModel, data = dat1, aux = "z",
missing = "fiml", estimator = "mlr")
summary(fitaux1)
## using BIFIEsurvey on multiply imputed data (cf. ?BIFIE.lavaan.survey)
imp <- mice(dat1, seed = 123, print = FALSE)
imp_list <- mice::complete(imp, action = "all", include = FALSE)
bdat <- BIFIE.data(imp_list)
bdat$dat1
userfct <- function(X, w) { #X <- bdat$dat1; w <- bdat$wgt;
X1 <- as.data.frame(X)
colnames(X1) <- bdat$varnames
fit <- growth(model = targetModel, data = X1)
lavaan_coef <- methods::getMethod("coef", "lavaan")
est <- lavaan_coef(fit)
return(c(est,
lavaan::fitMeasures(fit)[c("cfi","tli")]))
}
parnames <- names(userfct(bdat$dat1, bdat$wgt))
fitbsurv <- BIFIEsurvey::BIFIE.by(bdat, vars = bdat$varnames, userfct = userfct,
userparnames = parnames)
summary(fitbsurv, digits = 3)
Note that in the case of BIFIE.by
the estimated standard errors from lavaan
are not passed to BIFIEsurvey
. BIFIEsurvey
is designed to work with complex replication designs. If you do not have such a design, you could use BIFIE.data.jack
using a number of random jackknife groups (option jktype="JK_RANDOM"
).
Upvotes: 1