Reputation: 47
My data has 500000 observations and 7 variables. I split the data, 80% as training data and 20% test data. I used caret to train the model. Codes are below.I started it and it was taking so much time and eventually I had to stop it. Just wondering is there anything wrong in my model or it usually takes long time for big data? Any suggestion?
library(caret)
set.seed(130000000)
classifier_rf <- train(y=train$active,
x=train[3:5],
data=train,
method='rf',
trControl=trainControl(method='repeatedcv',
number=10,
repeats=10))
Upvotes: 4
Views: 7117
Reputation: 661
From my understanding, caret
still uses RandomForest
function underneath, plus the cross validation/grid search part, so it would take a while.
For random forest model specifically, I usually just use ranger
package, and it's so much faster. You can find their manual here.
Upvotes: 1
Reputation: 91
Your best bet is probably to try parallelizing the process. For a useful resource click here.
Upvotes: 2
Reputation: 4671
500,000
samples might be a lot for your machine depending on how powerful it is. However, you have specified repeated cross fold validation, which is a time consuming process.
When you think of single cross fold validation a model is trained K times and tested on the K-1 holdout portion. Your K is 10
in the provided context, and you are repeating it 10 times so that is 100 models.
These 100 models have to be trained and then tested, I would test your problem on a single training/testing example before moving onto cross fold validation, it will also help estimate the expected run time.
As an aside note, set.seed()
does't require such a large number, any simple smaller number is usually sufficient.
You've also specified the x
, y
, and data
arguments, I believe you only need to specify data
when using a formulaic definition for the training.
Upvotes: 0