Reputation: 799
I have a dataset consisting of 20 features and roughly 300,000 observations. I'm using caret to train model with doParallel and four cores. Even training on 10% of my data takes well over eight hours for the methods I've tried (rf, nnet, adabag, svmPoly). I'm resampling with with bootstrapping 3 times and my tuneLength is 5. Is there anything I can do to speed up this agonizingly slow process? Someone suggested using the underlying library can speed up my the process as much as 10x, but before I go down that route I'd like to make sure there is no other alternative.
Upvotes: 13
Views: 13410
Reputation: 14316
@phiver hits the nail on the head but, for this situation, there are a few things to suggest:
Max
Upvotes: 18
Reputation: 903
Great inputs by @phiver and @topepo. I will try to summarize and add some more points that I gathered from the little bit of SO posts searching that I did for a similar problem:
Upvotes: 2
Reputation: 23598
What people forget when comparing the underlying model versus using caret is that caret has a lot of extra stuff going on.
Take for example your randomforest. so bootstrap, number 3, and tuneLength 5. So you resample 3 times, and because of the tuneLength you try to find a good value for mtry. In total you run 15 random forests and comparing these to get the best one for the final model, versus only 1 if you use the basic random forest model.
Also you are running parallel on 4 cores and randomforest needs all the observations available, so all your training observations will be 4 times in memory. Probably not much memory left for training the model.
My advice is to start scaling down to see if you can speed things up, like setting the bootstrap number to 1 and tune length back to the default 3. Or even setting the traincontrol method to "none", just to get an idea on how fast the model is on the minimal settings and no resampling.
Upvotes: 16