NiMa
NiMa

Reputation: 173

Caret doesn't run in parallel

Actual parallelizing caret depends on R , caret and doMC packages . As described at Parallelizing Caret code

Does anyone working with similar enviroment as I do ? What the max R version where R caret paralellization working correctly ?

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=C                  LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] caret_6.0-52    ggplot2_1.0.1   lattice_0.20-31 doMC_1.3.3      iterators_1.0.7 foreach_1.4.2   RStudioAMI_0.2 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1         magrittr_1.5        splines_3.2.1       MASS_7.3-41         munsell_0.4.2       colorspace_1.2-6   
 [7] minqa_1.2.4         car_2.1-0           stringr_1.0.0       plyr_1.8.3          tools_3.2.1         pbkrtest_0.4-2     
[13] nnet_7.3-9          grid_3.2.1          gtable_0.1.2        nlme_3.1-120        mgcv_1.8-6          quantreg_5.19      
[19] MatrixModels_0.4-1  gtools_3.5.0        lme4_1.1-9          digest_0.6.8        Matrix_1.2-0        nloptr_1.0.4       
[25] reshape2_1.4.1      codetools_0.2-11    stringi_0.5-5       BradleyTerry2_1.0-6 scales_0.3.0        stats4_3.2.1       
[31] SparseM_1.7         brglm_0.5-9         proto_0.3-10

Update 1 : My code follows :

library(doMC) ; registerDoMC(cores=4)
library(caret)
classification_formula <- as.formula(paste("target" ,"~",
                                             paste(names(m_input_data)[!names(m_input_data)=='target'],collapse="+")))

CVfolds <- 2
CVreps  <- 5
ma_control <- trainControl(method = "repeatedcv",
                             number = CVfolds,
                             repeats = CVreps ,
                             returnResamp = "final" ,
                             classProbs = T,
                             summaryFunction = twoClassSummary,
                             allowParallel = TRUE,verboseIter = TRUE)
 rf_tuneGrid = expand.grid(mtry = seq(2,32, length.out = 6))
 rf <- train(classification_formula , data = m_input_data , method = "rf", metric="ROC" ,trControl = ma_control, tuneGrid = rf_tuneGrid , ntree = 101)

Update 2 : When I run from command line the only one core is working When I run these script from Rstudio the paralell is working since I see 4 processes via top . But a second after this the error happens :

  Error in names(resamples) <- gsub("^\\.", "", names(resamples)) : 
   attempt to set an attribute on NULL 

Update 4 :

Hi , it seems the problem was in R session that was terminated . Each time I am start AWS instance I was run the R code with now refresh the R engine . Now each time I refresh Rstudio browser I do Session -> Restart R . Seems it runs . I am checking now if the same for run the script from Ubuntu command line.

Generally it is running without to finish . Caret parallel on the data level . It means it is able to process each resample on different process . But if sample still big ( 100,000 / 2 ( number of folds = 2) X 2,000 features ) this can be hard to finish for each processor unit . Am I right ?

I think the parallelism must on algorithm level . It means each algorithm run likely to run on several cores . If such algorithm imlpementation avialable in caret ???

Upvotes: 3

Views: 1716

Answers (1)

howaj
howaj

Reputation: 745

I have the latest release for Linux platforms, R version 3.2.2 (2015-08-14, Fire Safety), and paralellization works fine. Can you provide your code that does not work in parallel.

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] kernlab_0.9-22  doMC_1.3.3      iterators_1.0.7 foreach_1.4.2   caret_6.0-52    ggplot2_1.0.1   lattice_0.20-33

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0         compiler_3.2.2      nloptr_1.0.4        plyr_1.8.3          tools_3.2.2         digest_0.6.8       
 [7] lme4_1.1-9          nlme_3.1-122        gtable_0.1.2        mgcv_1.8-7          Matrix_1.2-2        brglm_0.5-9        
[13] SparseM_1.7         proto_0.3-10        BradleyTerry2_1.0-6 stringr_1.0.0       gtools_3.5.0        MatrixModels_0.4-1 
[19] stats4_3.2.2        grid_3.2.2          nnet_7.3-10         minqa_1.2.4         reshape2_1.4.1      car_2.0-26         
[25] magrittr_1.5        scales_0.3.0        codetools_0.2-11    MASS_7.3-43         splines_3.2.2       pbkrtest_0.4-2     
[31] colorspace_1.2-6    quantreg_5.18       stringi_0.5-5       munsell_0.4.2      

I've used your code for the BreastCancer dataset on my local machine and it worked in parallel without any problem. I am using RStudio Version 0.98.1103.

library(caret)
library(mlbench)
data(BreastCancer)

library(doMC)  
registerDoMC(cores=2)

classification_formula <- as.formula(paste("Class" ,"~",
                                         paste(names(BreastCancer)[!names(BreastCancer)=='Class'],collapse="+")))

CVfolds <- 2
CVreps  <- 5
ma_control <- trainControl(method = "repeatedcv",
                           number = CVfolds,
                           repeats = CVreps ,
                           returnResamp = "final" ,
                           classProbs = T,
                           summaryFunction = twoClassSummary,
                           allowParallel = TRUE,verboseIter = TRUE)

rf_tuneGrid = expand.grid(mtry = seq(2,32, length.out = 6))

#Notice, it might be easier just to use Class~. 
#instead of classification_formula
rf <- train(classification_formula , 
            data = BreastCancer , 
            method = "rf", 
            metric="ROC" ,
            trControl = ma_control, 
            tuneGrid = rf_tuneGrid , 
            ntree = 101)

> rf
Random Forest 

699 samples
 10 predictors
  2 classes: 'benign', 'malignant' 

No pre-processing
Resampling: Cross-Validated (2 fold, repeated 5 times) 
Summary of sample sizes: 341, 342, 342, 341, 342, 341, ... 
Resampling results across tuning parameters:

 mtry  ROC        Sens       Spec       ROC SD       Sens SD      Spec SD    
   2    0.9867820  1.0000000  0.0000000  0.005007691  0.000000000  0.000000000
   8    0.9899107  0.9549550  0.9640196  0.002243649  0.006714919  0.017247716
  14    0.9907072  0.9558559  0.9631933  0.003028258  0.012345228  0.008019979
  20    0.9909514  0.9635135  0.9556513  0.003268291  0.006864342  0.010471005
  26    0.9911480  0.9630631  0.9539706  0.003384987  0.005113930  0.010628533
  32    0.9911485  0.9657658  0.9522969  0.002973508  0.004842197  0.004090206

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 32. 
> 

Upvotes: 1

Related Questions