Reputation: 5925
I am interested in running a Random Forest model on a very large dataset. I have been reading about "parallel computing" in an effort to make the code run faster. I came across this post over here (parallel execution of random forest in R) that had some suggestions:
library(randomForest)
library(doMC)
registerDoMC()
x <- matrix(runif(500), 100)
y <- gl(2, 50)
rf <- foreach(ntree=rep(25000, 6), .combine=randomForest::combine,
.multicombine=TRUE, .packages='randomForest') %dopar% {
randomForest(x, y, ntree=ntree)
}
I am trying to understand what is happening in the above code - my guess is that perhaps 6 Random Forest models (with each Random Forest Model having 25000 trees) are being fit to dataset and then combined into a single model?
I started looking into the "combine()" function in R (https://cran.r-project.org/web/packages/randomForest/randomForest.pdf) - it seems that the "combine()" function is combining several Random Forest models into a single model (here, I think 3 Random Forest models are being combined into a single model):
data(iris)
rf1 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
rf2 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
rf3 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
rf.all <- combine(rf1, rf2, rf3)
print(rf.all)
My Question: Can someone please confirm if I have understood this correctly? In the above code, are 6 Random Forest models being trained in parallel and then combined into a single model - is this correct?
References:
Upvotes: 0
Views: 118
Reputation: 9910
Yes, I would say yes. foreach
's .combine=
arguments takes the function given for it to apply on the results the combination.
Upvotes: 1