Reputation: 579
I am fitting random forest models using packages caret and ranger, and trying to speed up using parallel processes. However the speed gain is very small. I am using a MacBook Pro (Retina, 13-inch, Late 2013), 2.4 GHz Intel Core i5, 8 GB 1600 MHz DDR3, macOS Sierra 10.12. A reproducible example:
library(caret)
library(mlbench)
data("Sonar")
start <- Sys.time()
mod_1 <- train(Class ~ ., data = Sonar, method = "ranger", num.trees = 10000)
stop <- Sys.time()
duration1 <- stop - start
duration1
This runs in 3.47 minutes. During them, in the Activity Monitor I see one R process with CPU usage around 300-330%. Now the parallel:
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
start <- Sys.time()
mod_2 <- train(Class ~ ., data = Sonar, method = "ranger", num.trees = 10000)
stop <- Sys.time()
duration2 <- stop - start
duration2
This runs in 3.06 minutes. During them, in the Activity Monitor I see 3 R process each with CPU usage around 100-120%. I also tested the doMC package suggested in caret documentation (http://topepo.github.io/caret/parallel-processing.html), which took 3.10 minutes. This speed gain is much smaller than I expected, from the plots in caret documentation. Any ideas?
sessionInfo:
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] doParallel_1.0.10 ranger_0.6.0 e1071_1.6-7 doMC_1.3.4 iterators_1.0.8
[6] foreach_1.4.3 mlbench_2.1-1 caret_6.0-73 ggplot2_2.2.1 lattice_0.20-34
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 magrittr_1.5 splines_3.3.2 MASS_7.3-45 munsell_0.4.3
[6] colorspace_1.3-2 minqa_1.2.4 stringr_1.1.0 car_2.1-4 plyr_1.8.4
[11] tools_3.3.2 nnet_7.3-12 pbkrtest_0.4-6 grid_3.3.2 gtable_0.2.0
[16] nlme_3.1-130 mgcv_1.8-16 quantreg_5.29 class_7.3-14 MatrixModels_0.4-1
[21] lme4_1.1-12 lazyeval_0.2.0 assertthat_0.1 tibble_1.2 Matrix_1.2-8
[26] nloptr_1.0.4 reshape2_1.4.2 ModelMetrics_1.1.0 codetools_0.2-15 stringi_1.1.2
[31] compiler_3.3.2 scales_0.4.1 stats4_3.3.2 SparseM_1.74
Update: After response by slonopotam, I tested the same models above with the package randomForest (version 4.6-12). Running sequentially (not parallel) it took 8.14 minutes. During them, in the Activity Monitor I see one R process with CPU 95-100%. Running in parallel it took 3.72 minutes, during which there were 3 R process each with CPU 95-100%. Adding this info just for completion. Thanks, slonopotam!
Upvotes: 1
Views: 317
Reputation: 1710
The package 'ranger' you are using does have an internal multithreading support. That's why you are observing CPU usage aroung 300..330% in the first case - which means it already uses at least 3 cores for training.
When using doParallel, use are using multiprocessing instead of multithreading, but the total number of computing resources used in training is nearly the same, so you are not seeing much gain.
Upvotes: 1