mjoudy
mjoudy

Reputation: 149

time for performing predicting model in R

I have a dataset with about 20k rows and 160 column. after some simple preprocess like near-zero variance and removing variables with a high amount of NAs, I kept only 56 column as features. now, I want to perform a training model on this data with random forest method. but after about an hour it didn't answer and I aborted it.

Are there any codes that I can predict the time needed to train the model based on my PC's configuration? Usually, how much does it take to perform a random forest or rpart training method on a dataset with this dimensions?

Upvotes: 2

Views: 328

Answers (2)

agenis
agenis

Reputation: 8377

You can use the package GuessCompx library to predict the empirical complexity and computation time of your randomForest algorithm. Let's create a fake data the same size of yours:

df = data.frame(matrix(rpois(20000*56, 3), ncol=56))

Then, load the library;

library(GuessCompx)
library(randomForest)

Run the test, you get a N*log(N) time complexity:

CompEst(df, randomForest)
#### $`TIME COMPLEXITY RESULTS`$best.model
#### [1] "NLOGN"
#### $`TIME COMPLEXITY RESULTS`$computation.time.on.full.dataset
#### [1] "3M 30.31S"
#### $`MEMORY COMPLEXITY RESULTS`
#### $`MEMORY COMPLEXITY RESULTS`$best.model
#### [1] "QUADRATIC"
#### $`MEMORY COMPLEXITY RESULTS`$memory.usage.on.full.dataset
#### [1] "14033 Mb"

It seems that time is not a problem, but more memory limitation (14Go theoretical estimated) because it reaches the limit of the system and that gets in the way and might slow down the algorithm a lot (the 3 minutes predicted for total time are in practice exceeded by the memory needs, it took 12 minutes for me)? Try to increase memory.limits as much as you can.

Upvotes: 1

quickreaction
quickreaction

Reputation: 685

Try setting some parameters for the randomForest function. Start with a small number of trees (ntree) and/or a small number of number of variables drawn at each split (mtry), and/or a small number of "leaves" (maxnodes). Then change the parameters to increase your model complexity and accuracy. This will also help your computer's computational speed as you start small and slowly increase parameters to see their effect on performance.

Note, if you're using randomForest for feature selection (which is why I use it), use a large number of ntree, a low number of mtry, and a low number of maxnodes so you can extract good information about univariates.

Upvotes: 2

Related Questions