I. Ara
I. Ara

Reputation: 47

Caret train rf model - how long it takes to execute big data?

My data has 500000 observations and 7 variables. I split the data, 80% as training data and 20% test data. I used caret to train the model. Codes are below.I started it and it was taking so much time and eventually I had to stop it. Just wondering is there anything wrong in my model or it usually takes long time for big data? Any suggestion?

library(caret)
set.seed(130000000)

classifier_rf <- train(y=train$active,
                       x=train[3:5],
                       data=train,
                       method='rf',
                       trControl=trainControl(method='repeatedcv',
                                              number=10,
                                              repeats=10))

Upvotes: 4

Views: 7117

Answers (3)

timxymo1225
timxymo1225

Reputation: 661

From my understanding, caret still uses RandomForest function underneath, plus the cross validation/grid search part, so it would take a while.

For random forest model specifically, I usually just use ranger package, and it's so much faster. You can find their manual here.

Upvotes: 1

Sibs
Sibs

Reputation: 91

Your best bet is probably to try parallelizing the process. For a useful resource click here.

Upvotes: 2

zacdav
zacdav

Reputation: 4671

500,000 samples might be a lot for your machine depending on how powerful it is. However, you have specified repeated cross fold validation, which is a time consuming process.

When you think of single cross fold validation a model is trained K times and tested on the K-1 holdout portion. Your K is 10 in the provided context, and you are repeating it 10 times so that is 100 models.

These 100 models have to be trained and then tested, I would test your problem on a single training/testing example before moving onto cross fold validation, it will also help estimate the expected run time.


As an aside note, set.seed() does't require such a large number, any simple smaller number is usually sufficient.

You've also specified the x, y, and data arguments, I believe you only need to specify data when using a formulaic definition for the training.

Upvotes: 0

Related Questions