Reputation: 150
I am using the ContaminatedMixt package, CNmixt function for clustering multidimensional data. My data dimensions range from 3 to 9. Most of the data can be clustered with no issue, but I consistently struggle to cluster the 7D data.
Apologies for not being able to include a standalone reprex, but I know the error reliably occurs with my own data. You can download a .csv of the data via GoogleDrive.
To reproduce the error:
library(ContaminatedMixt)
df = read.csv (file = "210102_7clickcodas.csv",
header = TRUE,
fileEncoding = "UTF-8-BOM")
df = round(df, 3)
clusters = CNmixt (
df,
G = 2:10,
contamination = TRUE,
model = NULL,
initialization = "kmeans",
parallel = TRUE)
What I'm asking CNmixt to do is fit 2, then 3, then 4,..., then 10 clusters to the data and choose the number of clusters that is best (based on a criterion - I use ICL but BIC, AIC, etc. are also options). When I run this code, however, I always encounter the following error:
Error in checkForRemoteErrors(val) :
one node produced an error: BLAS/LAPACK routine 'DGEBAL' gave error code -3
Sometimes more than one node produces an error, but the error message is always the same. I've tried to troubleshoot which value of G is causing the issue by setting G to each value in the 2:10 range rather than the range itself (G=2, then G=3, then G=4, etc.). It looks like CNmixt can't divide the 7D data into 4 clusters. Doing so causes the error to throw and the program to stop running.
Most importantly, I want to: a) tell the CNmixt program to bypass any values of G that result in the error and to continue on. So if G=4 was the problem, the program would fit 2 clusters, 3 clusters, try to fit 4 and fail, and then move on to fitting 5 clusters. At the end of fitting 2:9 clusters it would choose the number of clusters that optimized the criterion and were fitted successfully. I think this would involve tweaking the CNmixt source code directly (the .CNmixtG2 function I think), but a way to do it without manipulating the CNmixt source code would be fantastic as well.
b) It would also be great to understand why this error occurs. I know that others have asked about this error before (for example, r msm BLAS/LAPACK routine 'DGEBAL' gave error code -3), and based on what I've read, the error code -3 seems to mean that there is an illegal value in the third INFO parameter. But it's proven to be really challenging for me to figure out where in the CNmixt code the 'DGEBAL' function is being called, what exactly it's doing, and why it's failing.
Any advice on how to do this would be appreciated!
Upvotes: 0
Views: 202
Reputation: 21937
This way might work for you. If you look at the R code for the CNmixt()
function it just repeatedly calls the CNmixt_main()
function, though this function isn't exported from the namespace. The first thing I did was to set up a search grid using expand.grid()
to get all combinations of the number of clusters and the models.
eg <- expand.grid(G=2:10,
model=c("EII", "VII", "EEI", "VEI", "EVI", "VVI",
"EEE", "VEE", "EVE", "EEV", "VVE", "VEV",
"EVV", "VVV"),
stringsAsFactors=FALSE)
Next, I initialize an output object (ICS
) and then loop over the rows of eg
. For each row (a combination of model and number of clusters), I estimate the model wrapped in try()
which makes it so that if a model throws an error it doesn't stop the loop. The object out would just contain the error and it is of class "try-error"
. If the function didn't terminate with an error, it returns the IC values otherwise, it returns a vector of NA
values that is as long as the vector of IC values that would be returned otherwise. Finally, add the IC values as a new row of ICS
which accumulates the values for each model.
ICS <- NULL
for(i in 1:nrow(eg)){
out <- try(ContaminatedMixt:::CNmixt_main (
df,
G = eg$G[i],
contamination = TRUE,
model = eg$model[i],
initialization = "kmeans",
alphafix = NULL,
alphamin = .5,
seed = NULL,
start.z=NULL,
start.v = NULL,
start = 0,
label = NULL,
AICcond = FALSE,
iter.max = 1000,
threshold = 1e-10,
parallel=FALSE,
eps=1e-100,
verbose=TRUE,
doCV = FALSE))
if(inherits(out, "try-error")){
ics <- rep(NA, 8)
}else{
ics <- unlist(out$models[[1]]$IC)
}
ICS <- rbind(ICS, ics)
}
Next, we could attach the IC values to the parameters over which we searched.
eg <- bind_cols(eg, as_tibble(ICS))
Next, we could find the best ICL value by cluster size. This is assuming that the best ICL is the smallest one, if it's the biggest one just replace min(ICL, na.rm=TRUE)
with max(ICL, na.rm=TRUE)
best_per_cluster <- eg %>%
group_by(G) %>%
filter(ICL == min(ICL, na.rm=TRUE)) %>%
arrange(G)
best_per_cluster
# # A tibble: 9 x 10
# # Groups: G [9]
# G model AIC BIC AIC3 CAIC AWE ICL AICc AICu
# <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2 VII 28960. 28854. 28941. 28835. 28653. 28724. 28960. 28940.
# 2 3 EVI 31190. 30955. 31148. 30913. 30511. 30786. 31188. 31145.
# 3 4 VEI 34230. 33984. 34186. 33940. 33518. 33702. 34228. 34183.
# 4 5 EVI 35896. 35504. 35826. 35434. 34763. 35373. 35891. 35818.
# 5 6 EEI 34959. 34629. 34900. 34570. 34004. 34446. 34955. 34894.
# 6 7 EVI 38211. 37663. 38113. 37565. 36625. 37450. 38201. 38099.
# 7 8 EEI 39301. 38870. 39224. 38793. 38054. 38058. 39294. 39215.
# 8 9 EEI 37999. 37518. 37913. 37432. 36607. 37259. 37991. 37902.
# 9 10 EII 34709. 34205. 34619. 34115. 33252. 33900. 34700. 34607.
Finally, we can pick the best model, again assuming that the best ICL is the smallest one.
best_per_cluster %>%
ungroup %>%
filter(ICL == min(ICL))
# # A tibble: 1 x 10
# G model AIC BIC AIC3 CAIC AWE ICL AICc AICu
# <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2 VII 28960. 28854. 28941. 28835. 28653. 28724. 28960. 28940.
Obviously, the getting the best model per cluster size is an unnecessary step, but might be interesting to look at just the same.
Also, I've not answered your other question about why the error occurs. When I ran the CNmixt()
function, I didn't get the error until G=10, but when I ran the model using CNmixt_main()
I got the error earlier in the process (e.g., G=3 and G=4 for the EVE model). In my particular case, the error isn't quite as reliable as it is for you. My session info is below just for reference.
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ContaminatedMixt_1.3.4.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.3
[4] gower_0.2.1 plyr_1.8.6 class_7.3-17
[7] iterators_1.0.12 tools_4.0.3 rpart_4.1-15
[10] ipred_0.9-9 mclust_5.4.6 lubridate_1.7.9
[13] mixture_1.5.1 lifecycle_0.2.0 tibble_3.0.3
[16] gtable_0.3.0 nlme_3.1-149 lattice_0.20-41
[19] pkgconfig_2.0.3 rlang_0.4.7 Matrix_1.2-18
[22] foreach_1.5.0 rstudioapi_0.11 parallel_4.0.3
[25] mvtnorm_1.1-1 prodlim_2019.11.13 stringr_1.4.0
[28] withr_2.2.0 dplyr_1.0.2 pROC_1.16.2
[31] generics_0.0.2 vctrs_0.3.4 recipes_0.1.12
[34] stats4_4.0.3 nnet_7.3-14 grid_4.0.3
[37] caret_6.0-86 tidyselect_1.1.0 data.table_1.13.0
[40] glue_1.4.2 R6_2.4.1 survival_3.2-7
[43] lava_1.6.7 reshape2_1.4.4 ggplot2_3.3.2
[46] purrr_0.3.4 magrittr_1.5 ModelMetrics_1.2.2.2
[49] splines_4.0.3 MASS_7.3-53 scales_1.1.1
[52] codetools_0.2-16 ellipsis_0.3.1 mnormt_2.0.1
[55] timeDate_3043.102 colorspace_1.4-1 stringi_1.5.3
[58] munsell_0.5.0 tmvnsim_1.0-2 crayon_1.3.4
Upvotes: 2