TAH
TAH

Reputation: 150

Bypassing 'BLAS/LAPACK routine 'DGEBAL' gave error code -3' in CNmixt fn (ContaminatedMixt pkg)

I am using the ContaminatedMixt package, CNmixt function for clustering multidimensional data. My data dimensions range from 3 to 9. Most of the data can be clustered with no issue, but I consistently struggle to cluster the 7D data.

Apologies for not being able to include a standalone reprex, but I know the error reliably occurs with my own data. You can download a .csv of the data via GoogleDrive.

To reproduce the error:

library(ContaminatedMixt)
df = read.csv (file = "210102_7clickcodas.csv",
                     header = TRUE,
                     fileEncoding =  "UTF-8-BOM")
df = round(df, 3) 
clusters = CNmixt (
  df,
  G = 2:10,
  contamination = TRUE,
  model = NULL,
  initialization = "kmeans",
  parallel = TRUE)

What I'm asking CNmixt to do is fit 2, then 3, then 4,..., then 10 clusters to the data and choose the number of clusters that is best (based on a criterion - I use ICL but BIC, AIC, etc. are also options). When I run this code, however, I always encounter the following error:

Error in checkForRemoteErrors(val) : 
  one node produced an error: BLAS/LAPACK routine 'DGEBAL' gave error code -3

Sometimes more than one node produces an error, but the error message is always the same. I've tried to troubleshoot which value of G is causing the issue by setting G to each value in the 2:10 range rather than the range itself (G=2, then G=3, then G=4, etc.). It looks like CNmixt can't divide the 7D data into 4 clusters. Doing so causes the error to throw and the program to stop running.

Most importantly, I want to: a) tell the CNmixt program to bypass any values of G that result in the error and to continue on. So if G=4 was the problem, the program would fit 2 clusters, 3 clusters, try to fit 4 and fail, and then move on to fitting 5 clusters. At the end of fitting 2:9 clusters it would choose the number of clusters that optimized the criterion and were fitted successfully. I think this would involve tweaking the CNmixt source code directly (the .CNmixtG2 function I think), but a way to do it without manipulating the CNmixt source code would be fantastic as well.

b) It would also be great to understand why this error occurs. I know that others have asked about this error before (for example, r msm BLAS/LAPACK routine 'DGEBAL' gave error code -3), and based on what I've read, the error code -3 seems to mean that there is an illegal value in the third INFO parameter. But it's proven to be really challenging for me to figure out where in the CNmixt code the 'DGEBAL' function is being called, what exactly it's doing, and why it's failing.

Any advice on how to do this would be appreciated!

Upvotes: 0

Views: 202

Answers (1)

DaveArmstrong
DaveArmstrong

Reputation: 21937

This way might work for you. If you look at the R code for the CNmixt() function it just repeatedly calls the CNmixt_main() function, though this function isn't exported from the namespace. The first thing I did was to set up a search grid using expand.grid() to get all combinations of the number of clusters and the models.

eg <- expand.grid(G=2:10, 
            model=c("EII", "VII", "EEI", "VEI", "EVI", "VVI", 
                    "EEE", "VEE", "EVE", "EEV", "VVE", "VEV", 
                    "EVV", "VVV"), 
            stringsAsFactors=FALSE)

Next, I initialize an output object (ICS) and then loop over the rows of eg. For each row (a combination of model and number of clusters), I estimate the model wrapped in try() which makes it so that if a model throws an error it doesn't stop the loop. The object out would just contain the error and it is of class "try-error". If the function didn't terminate with an error, it returns the IC values otherwise, it returns a vector of NA values that is as long as the vector of IC values that would be returned otherwise. Finally, add the IC values as a new row of ICS which accumulates the values for each model.

ICS <- NULL
for(i in 1:nrow(eg)){
  out <- try(ContaminatedMixt:::CNmixt_main (
    df,
    G = eg$G[i],
    contamination = TRUE,
    model = eg$model[i],
    initialization = "kmeans",
    alphafix = NULL, 
    alphamin = .5, 
    seed = NULL, 
    start.z=NULL, 
    start.v = NULL, 
    start = 0, 
    label = NULL, 
    AICcond = FALSE, 
    iter.max = 1000, 
    threshold = 1e-10, 
    parallel=FALSE, 
    eps=1e-100, 
    verbose=TRUE, 
    doCV = FALSE))
    if(inherits(out, "try-error")){
      ics <- rep(NA, 8)
    }else{
      ics <- unlist(out$models[[1]]$IC)
    }
  ICS <- rbind(ICS, ics)
}

Next, we could attach the IC values to the parameters over which we searched.

eg <- bind_cols(eg, as_tibble(ICS))

Next, we could find the best ICL value by cluster size. This is assuming that the best ICL is the smallest one, if it's the biggest one just replace min(ICL, na.rm=TRUE) with max(ICL, na.rm=TRUE)

best_per_cluster <- eg %>% 
  group_by(G) %>% 
  filter(ICL == min(ICL, na.rm=TRUE)) %>% 
  arrange(G)
best_per_cluster 
# # A tibble: 9 x 10
# # Groups:   G [9]
#      G model    AIC    BIC   AIC3   CAIC    AWE    ICL   AICc   AICu
#   <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
# 1     2 VII   28960. 28854. 28941. 28835. 28653. 28724. 28960. 28940.
# 2     3 EVI   31190. 30955. 31148. 30913. 30511. 30786. 31188. 31145.
# 3     4 VEI   34230. 33984. 34186. 33940. 33518. 33702. 34228. 34183.
# 4     5 EVI   35896. 35504. 35826. 35434. 34763. 35373. 35891. 35818.
# 5     6 EEI   34959. 34629. 34900. 34570. 34004. 34446. 34955. 34894.
# 6     7 EVI   38211. 37663. 38113. 37565. 36625. 37450. 38201. 38099.
# 7     8 EEI   39301. 38870. 39224. 38793. 38054. 38058. 39294. 39215.
# 8     9 EEI   37999. 37518. 37913. 37432. 36607. 37259. 37991. 37902.
# 9    10 EII   34709. 34205. 34619. 34115. 33252. 33900. 34700. 34607.

Finally, we can pick the best model, again assuming that the best ICL is the smallest one.

best_per_cluster %>% 
  ungroup %>% 
  filter(ICL == min(ICL))
# # A tibble: 1 x 10
#         G model    AIC    BIC   AIC3   CAIC    AWE    ICL   AICc   AICu
#     <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#   1     2 VII   28960. 28854. 28941. 28835. 28653. 28724. 28960. 28940.

Obviously, the getting the best model per cluster size is an unnecessary step, but might be interesting to look at just the same.

Also, I've not answered your other question about why the error occurs. When I ran the CNmixt() function, I didn't get the error until G=10, but when I ran the model using CNmixt_main() I got the error earlier in the process (e.g., G=3 and G=4 for the EVE model). In my particular case, the error isn't quite as reliable as it is for you. My session info is below just for reference.

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ContaminatedMixt_1.3.4.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5           pillar_1.4.6         compiler_4.0.3      
 [4] gower_0.2.1          plyr_1.8.6           class_7.3-17        
 [7] iterators_1.0.12     tools_4.0.3          rpart_4.1-15        
[10] ipred_0.9-9          mclust_5.4.6         lubridate_1.7.9     
[13] mixture_1.5.1        lifecycle_0.2.0      tibble_3.0.3        
[16] gtable_0.3.0         nlme_3.1-149         lattice_0.20-41     
[19] pkgconfig_2.0.3      rlang_0.4.7          Matrix_1.2-18       
[22] foreach_1.5.0        rstudioapi_0.11      parallel_4.0.3      
[25] mvtnorm_1.1-1        prodlim_2019.11.13   stringr_1.4.0       
[28] withr_2.2.0          dplyr_1.0.2          pROC_1.16.2         
[31] generics_0.0.2       vctrs_0.3.4          recipes_0.1.12      
[34] stats4_4.0.3         nnet_7.3-14          grid_4.0.3          
[37] caret_6.0-86         tidyselect_1.1.0     data.table_1.13.0   
[40] glue_1.4.2           R6_2.4.1             survival_3.2-7      
[43] lava_1.6.7           reshape2_1.4.4       ggplot2_3.3.2       
[46] purrr_0.3.4          magrittr_1.5         ModelMetrics_1.2.2.2
[49] splines_4.0.3        MASS_7.3-53          scales_1.1.1        
[52] codetools_0.2-16     ellipsis_0.3.1       mnormt_2.0.1        
[55] timeDate_3043.102    colorspace_1.4-1     stringi_1.5.3       
[58] munsell_0.5.0        tmvnsim_1.0-2        crayon_1.3.4        

Upvotes: 2

Related Questions