user1631306
user1631306

Reputation: 4470

parallel execution of random forest in R

I am running random forest in R in parallel

library(doMC)
registerDoMC()
x <- matrix(runif(500), 100)
y <- gl(2, 50)

Parallel execution (took 73 sec)

rf <- foreach(ntree=rep(25000, 6), .combine=combine, .packages='randomForest') %dopar%
randomForest(x, y, ntree=ntree) 

Sequential execution (took 82 sec)

rf <- foreach(ntree=rep(25000, 6), .combine=combine) %do%
randomForest(x, y, ntree=ntree) 

In parallel execution, the tree generation is pretty quick like 3-7 sec, but the rest of the time is consumed in combining the results (combine option). So, its only worth to run parallel execution is the number of trees are really high. Is there any way I can tweak "combine" option to avoid any calculation at each node which I dont need and make it more faster

PS. Above is just an example of data. In real I have some 100 thousands features for some 100 observations.

Upvotes: 32

Views: 30638

Answers (5)

Steve Weston
Steve Weston

Reputation: 19667

Setting .multicombine to TRUE can make a significant difference:

rf <- foreach(ntree=rep(25000, 6), .combine=randomForest::combine,
              .multicombine=TRUE, .packages='randomForest') %dopar% {
    randomForest(x, y, ntree=ntree)
}

This causes combine to be called once rather than five times. On my desktop machine, this runs in 8 seconds rather than 19 seconds.

Upvotes: 35

Ashok Krishna
Ashok Krishna

Reputation: 143

H20 package can be used to solve your problem.

According to H20 documentation page H2O is "the open source math engine for big data that computes parallel distributed machine learning algorithms such as generalized linear models, gradient boosting machines, random forests, and neural networks (deep learning) within various cluster environments."

Random Forest implementation using H2O:

https://www.analyticsvidhya.com/blog/2016/05/h2o-data-table-build-models-large-data-sets/

Upvotes: 5

Richard
Richard

Reputation: 61259

I wonder if the parallelRandomForest code would be helpful to you?

According to the author it ran about 6 times faster on his data set with 16 times less memory consumption.

SPRINT also has a parallel implementation here.

Upvotes: 4

Soren Havelund Welling
Soren Havelund Welling

Reputation: 1893

Depending on your CPU, you could probably get 5%-30% speed-up choosing number of jobs to match your number of registered cores matching the number of system logical cores. (sometimes it is more efficient to match number of system physical cores). If you have a generic Intel dual-core laptop with Hyper Threading(4 logical cores), then DoMC probably registered a cluster of 4 cores. Thus 2 cores will idle when iteration 5 and 6 are computed plus the extra time starting/stopping two extra jobs. It would more efficient to make only 2-4 jobs of more trees.

Upvotes: 1

Dirk is no longer here
Dirk is no longer here

Reputation: 368201

Are you aware that the caret package can do a lot of the hand-holding for parallel runs (as well as data prep, summaries, ...) for you?

Ultimately, of course, if there are some costly operations left in the random forest computation itself, there is little you can do as Andy spent quite a few years on improving it. I would expect few to no low-hanging fruits to be around for the picking...

Upvotes: 12

Related Questions