Reputation: 1266
I'm trying to run the train()
function from the caret
package. However, the time it takes to run is making it prohibitive. I've tried improving the speed by running on multiple cores, but even so... it's still loading. Are there any other alternative ways to speed up machine learning processes like this?
library(parallel)
library(doParallel)
library(caret)
library(mlbench)
library(caret)
data(Sonar)
inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
trControl <- trainControl(method = "cv", number = 5, allowParallel = T)
system.time(fit <- train(x,y, method="rf",data=Sonar,trControl = trControl))
stopCluster(cluster)
Upvotes: 2
Views: 1824
Reputation: 2289
Many times I use caret but it seemed very slow, now I use the h2o package, it is very fast. I recommend reading this article and see why my decision. Now using the Sonar base, generate this code.
# Starts H2O using localhost IP, port 54321, all CPUs,and 6g of memory
data(Sonar,package = "mlbench")
library(h2o)
h2o.init(ip = "localhost", port = 54321, nthreads= -1,max_mem_size = "6g")
Sonar.split = h2o.splitFrame(data = as.h2o(Sonar),ratios = 0.75)
Sonar.train = Sonar.split[[1]]
Sonar.test = Sonar.split[[2]]
#hyper_params <- list(mtries = c(2,5,10), ntrees = c(100,250,500), max_depth = c(5,7,9))
hyper_params <- list(mtries = c(2,5,10))
system.time(grid <- h2o.grid(x = 1:60, y = 61, training_frame = Sonar.train, validation_frame = Sonar.test,
algorithm = "drf", grid_id = "covtype_grid", hyper_params = hyper_params,
search_criteria = list(strategy = "Cartesian"), seed = 1234))
# Sort the grid models by logloss
sortedGrid <- h2o.getGrid("covtype_grid", sort_by = "logloss", decreasing = FALSE)
sortedGrid
lBoth for the algorithm with caret and with h2o use the syste.time function the results were as follows: with caret 20.81 seconds
with h2o 1.94 seconds
. the execution time is more evident with larger data.
Upvotes: 1
Reputation: 3827
There are a number of steps you can take:
Reduce the number of features in your data using Principal Components Analysis (PCA) or Independent Component Analysis (ICA). You can use caret::Preprocess
to do this. You can also remove unimportant features if you run the random forest and inspect the feature importances.
Try using the ranger library implementation of random forest. Set method = 'ranger'
in the training call. I have found ranger is often quicker.
Upvotes: 1