Parallelization with Rborist

I have a large n (>1,000,000) dataset with a small number of features to estimate (regression) random forest and have been looking to implement Rborist (in R). I'd like to parallelize my work, but am not finding much guidance on how that would be done. I have 16 processors to use on the machine where it's running. When I use doParallel with the randomForest package, for example, the command:

rf <- foreach(ntree=rep(32, 16), .combine=combine, .packages='randomForest') %dopar% randomForest(x, y, nodesize = 25, ntree=ntree)

It launches 16 R processes, and works slowly as randomForest does, but works.

The analogous command for Rborist:

rb <- foreach(ntree=rep(32, 16), .combine=combine, .packages='Rborist') %dopar% Rborist(x, y, minNode = 25, ntree=ntree)

Throws the error:

error calling combine function:

Warning message: In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, : all scheduled cores encountered errors in user code

Does anyone know how to parallelize with Rborist? It does not appear to be happening under the hood as it's only using 1 cpu when I run:

rb <- Rborist(x, y, minNode = 25, ntree = 512)

Upvotes: 2

Answers (2)

Mark Seligman

Reputation: 21

Rborist currently uses all available cores. Would it be useful to offer a way to tune this?

Have you tried the latest version on CRAN, 0.1-3? This contains a change to the default minimum node size for regression, improving accuracy in some cases.

We've been making some strides toward improving performance with modest (as opposed to small) predictor count. This should also be reflected by changes in the latest release.

Large running memory footprint is probably a consequence of the breadth-first splitting approach. One way to conserve memory is to carve the problem into chunks, but we haven't gotten there yet.

Large final memory size is chiefly due to caching leaf information for subsequent use by other packages or for quantile regression. Perhaps we should add a "noLeaf" option for users who are not interested in either of those options.

Upvotes: 0

phiver

Reputation: 23598

Rborist runs in parallel by itself. It uses all my threads on my machine (win 10 64bit). But then I didn't load doParallel / foreach first.

Same goes for the ranger package, but in ranger you can set the number of threads to use.

Speedy implementations of a rf are of the top of my head:

Rborist (large n, low p)
ranger (handles large p, modest n)
random forest.ddr (haven't tested)
distributed random forest in H2O. very fast, but makes use of stopping criteria.

Upvotes: 1

Parallelization with Rborist

Answers (2)

Related Questions