Reputation: 11
I conducted latent class/cluster analysis in R using the package MCLUST. I have a revise and resubmit for my paper, and the reviewer suggested making a table of the fit indices for the cluster solutions (as of now I just reported BIC in the text). when I look at a few papers that have used LCA approaches, they report BIC, sample size adjusted BIC, and entropy; the only fit statistic of these that MCLUST gives is BIC. I can find entropy plots but not the entropy statistic. It's a little late for me to re-run my analyses on Mplus (which I figured out was used for the LCA in these papers). frankly, it's a little late to re-run my analyses using another clustering package. from all of my reading it sounds like MCLUST does things a tiny bit differently than some other k-means cluster approaches. ALSO - it seems that sometimes the model with the lowest BIC is selected (in some papers) but in MCLUST the highest one is selected? Why?
so, tldr; what other model selection stats are reported in write-ups when using MCLUST? is it standard/okay to just have bIC? how would I justify that?
Upvotes: 1
Views: 2670
Reputation: 464
Just a couple thoughts, having used mclust a bit previously.
1) mclust uses the correct BIC selection method; see this post:
https://stats.stackexchange.com/questions/237220/mclust-model-selection
See the very bottom, but to sum it up, with BIC it depends if you use the negative sign in the formula or not whether you optimize low vs. high:
The general definition of the BIC is BIC=−2×ln(L(θ|x))+k×ln(n)BIC=−2×ln(L(θ|x))+k×ln(n); mclust does not include the negative component.
2) mclust uses mixture models to perform the clustering (i.e., model-based); it's quite different from k-means so I would be careful with the phrasing that it's a "tiny bit different than some of the other k-means cluster approaches" (mainly in what "other" implies here); the process for model selection is briefly described in the mclust manual:
mclust provides a Gaussian mixture fitted to the data by maximum likelihood through the EM algorithm, for the model and number of components selected according to BIC. The corresponding components are hierarchically combined according to an entropy criterion, following the methodology described in the article cited in the references section. The solutions with numbers of classes between the one selected by BIC and one are returned as a clustCombi class object.
It's more useful to see the actual paper for a thorough explanation:
https://www.stat.washington.edu/raftery/Research/PDF/Baudry2010.pdf or here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2953822/
The entropy plot provided by mclust is meant to be interpreted like a scree plot for a factor analysis (i.e., by looking for an elbow to determine the optimal number of classes); I would argue scree plots are useful for justifying the choice of number of clusters, and these plots belong in the appendices.
mclust does also return the ICL statistic in addition to BIC, so you could choose to report that as a compromise to the reviewer:
https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html (see the example on how to get it to output the statistics)
3) if you wanted to create a table of the entPlot values, you can extract them like so (from the ?entPlot example):
## Not run:
data(Baudry_etal_2010_JCGS_examples)
# run Mclust to get the MclustOutput
output <- clustCombi(ex4.2, modelNames = "VII")
entPlot(output$MclustOutput$z, output$combiM, reg = c(2,3))
# legend: in red, the single-change-point piecewise linear regression;
# in blue, the two-change-point piecewise linear regression.
# added code to extract entropy values from the plot
combiM <- output$combiM
Kmax <- ncol(output$MclustOutput$z)
z0 <- output$MclustOutput$z
ent <- numeric()
for (K in Kmax:1) {
z0 <- t(combiM[[K]] %*% t(z0))
ent[K] <- -sum(mclust:::xlog(z0))
}
data.frame(`Number of clusters` = 1:Kmax, `Entropy` = round(ent, 3))
Number.of.clusters Entropy
1 1 0.000
2 2 0.000
3 3 0.079
4 4 0.890
5 5 6.361
6 6 20.158
7 7 35.336
8 8 158.008
Upvotes: 3