Reputation: 1266

R - Improve performance of caret::train function

I'm trying to run the train() function from the caret package. However, the time it takes to run is making it prohibitive. I've tried improving the speed by running on multiple cores, but even so... it's still loading. Are there any other alternative ways to speed up machine learning processes like this?

library(parallel)
library(doParallel)
library(caret)
library(mlbench)
library(caret)

data(Sonar)

inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]

cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

trControl <- trainControl(method = "cv", number = 5, allowParallel = T)
system.time(fit <- train(x,y, method="rf",data=Sonar,trControl = trControl))

stopCluster(cluster)

Upvotes: 2

Answers (2)

Rafael Díaz

Reputation: 2289

Many times I use caret but it seemed very slow, now I use the h2o package, it is very fast. I recommend reading this article and see why my decision. Now using the Sonar base, generate this code.

# Starts H2O using localhost IP, port 54321, all CPUs,and 6g of memory 
data(Sonar,package = "mlbench")
library(h2o)
h2o.init(ip = "localhost", port = 54321, nthreads= -1,max_mem_size = "6g")
Sonar.split = h2o.splitFrame(data = as.h2o(Sonar),ratios = 0.75)
Sonar.train = Sonar.split[[1]]
Sonar.test  = Sonar.split[[2]]

#hyper_params <- list(mtries = c(2,5,10), ntrees = c(100,250,500), max_depth = c(5,7,9))
hyper_params <- list(mtries = c(2,5,10))
system.time(grid <- h2o.grid(x = 1:60, y = 61, training_frame = Sonar.train, validation_frame = Sonar.test,
                 algorithm = "drf", grid_id = "covtype_grid", hyper_params = hyper_params,
                 search_criteria = list(strategy = "Cartesian"), seed = 1234))

# Sort the grid models by logloss
sortedGrid <- h2o.getGrid("covtype_grid", sort_by = "logloss", decreasing = FALSE)
sortedGrid

lBoth for the algorithm with caret and with h2o use the syste.time function the results were as follows: with caret 20.81 seconds with h2o 1.94 seconds. the execution time is more evident with larger data.

Upvotes: 1

willk

Reputation: 3827

There are a number of steps you can take:

Reduce the number of features in your data using Principal Components Analysis (PCA) or Independent Component Analysis (ICA). You can use caret::Preprocess to do this. You can also remove unimportant features if you run the random forest and inspect the feature importances.
Try using the ranger library implementation of random forest. Set method = 'ranger' in the training call. I have found ranger is often quicker.
Decrease the number of cross validation steps. This decreases the number of splits of the data and effectively the number of training iterations.

Upvotes: 1

R - Improve performance of caret::train function

Answers (2)

Related Questions