Reputation: 21
I am working now with the randomForest package in R. To speed up the classification step, I was interested in performing the forest in parallel. For that, I have used the package 'foreach' in a similar way that it is indicated on the 'foreach' vignette. This consists in splitting the total number of trees by the number of cores you would like to use, and then combining them with the function 'combine' of the package 'randomForest':
require(randomForest)
require(foreach)
require(doParallel)
registerDoParallel(cores=CPUS)
rf <- foreach::foreach(ntree=rep(ceiling(NTREE/CPUS), CPUS), .combine=randomForest::combine, .packages='randomForest') %dopar% {
randomForest::randomForest(x=t(Y), y=A, ntree=ntree, importance=TRUE, ...)
}
I compared the results of the "parallel" forest with the forest generated in one core. The prediction capacity with the test set seems to be similar, but the 'importance' values are considerably reduced, and this affects the following steps of variable selection.
imp <- importance(rf,type=1)
I would like to know why this happens, and if it is correct or there is any mistake. Thanks a lot!
Upvotes: 2
Views: 810
Reputation: 1893
randomForest::combine does not support re-calculation of variable importance. In randomForest package importance is only calculated just before randomForest::randomForest function terminates. Two options are:
Write your own variable importance function, which will take the combined forest and training set as inputs. That is roundly ~50 lines of code.
Use a 'lapply'-like parallel computation, where each randomForest object is an element in the output list. Next aggregate variable importance across all forests and simply compute the mean. Use do.call(rf.list,combine) outside foreach loop instead. This method is an approximation of the total variable importance, but a quite good one.
Windows supported code example:
library(randomForest)
library(doParallel)
CPUS=6; NTREE=5000
cl = makeCluster(CPUS)
registerDoParallel(cl)
data(iris)
rf.list = foreach(ntree = rep(NTREE/CPUS,CPUS),
.combine=c,
.packages="randomForest") %dopar% {
list(randomForest(Species~.,data=iris,importance=TRUE, ntree=ntree))
}
stopCluster(cl)
big.rf = do.call(combine,rf.list)
big.rf$importance = rf.list[[1]]$importance
for(i in 2:CPUS) big.rf$importance = big.rf$importance + rf.list[[i]]$importance
big.rf$importance = big.rf$importance / CPUS
varImpPlot(big.rf)
#test number of trees in one forest and combined forest, big.rf
print(big.rf) #5000 trees
rf.list[[1]]$ntree
#training single forest
rf.single = randomForest(Species~.,data=iris,ntree=5000,importance=T)
varImpPlot(big.rf)
varImpPlot(rf.single)
#print unscaled variable importance, no large deviations
print(big.rf$importance)
# setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
# Sepal.Length 0.033184860 0.023506673 0.04043017 0.03241500 9.679552
# Sepal.Width 0.008247786 0.002135783 0.00817186 0.00613059 2.358298
# Petal.Length 0.335508637 0.304525644 0.29786704 0.30933142 43.160074
# Petal.Width 0.330610910 0.307016328 0.27129746 0.30023245 44.043737
print(rf.single$importance)
# setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
# Sepal.Length 0.031771614 0.0236603417 0.03782824 0.031049531 9.516198
# Sepal.Width 0.008436457 0.0009236979 0.00880401 0.006048261 2.327478
# Petal.Length 0.341879367 0.3090482654 0.29766905 0.312507316 43.786481
# Petal.Width 0.322015885 0.3045458852 0.26885097 0.296227150 43.623370
#but when plotting using varImppLot, scale=TRUE by default
#either simply turn of scaling to get comparable results
varImpPlot(big.rf,scale=F)
varImpPlot(rf.single,scale=F)
#... or correct scaling to the number of trees
big.rf$importanceSD = CPUS^-.5 * big.rf$importanceSD
#and now there are no large differences for scaled variable importance either
varImpPlot(big.rf,scale=T)
varImpPlot(rf.single,scale=T)
Upvotes: 1