How to approximate the time for running a ML code with parallel computing in R?

Question

I am new to parallel computing and ML in R. So, I found it worrying when a programme could not complete after running over 15 minutes, as I have no idea how long should I expect a Machine learning programme on parallel computing would take or how to calculate the time the programme needs to run.

The following is a code on ML and parallel computing I tried which could not finish after 20 minutes. Could anyone suggest a way to figure out how long should I expect to wait for running 200000 rows of data with 14 columns on the following code? Or is there a problem with my code?

library(doMC)
registerDoMC(cores = 2)
set.seed(7)

fit.svmRadial <- train(gap~., data=trainingDataML, method="svmRadial", metric="RMSE",
    trControl=trainControl)
# summarize fit
print(fit.svmRadial)


library(parallel)
detectCores() # output: 4 cores

the Mac info is

vincentmajor · Accepted Answer

I have no experience parallelizing computation on Macbooks but I might be able to offer some advice as I commonly run ML algorithms that take hours or days to complete.

1. 15 minutes is not that long

For 200,000 rows of data, 15 minutes is not long at all! Leave it to execute overnight or in the background while you do other work.

2. Subset the input data and estimate the full run time

As Ben suggested I would subset the training data from 200,000 rows to say 2000 and track the time it takes to compute. I personally use this code to output the compute time.

sys.time = proc.time()
code...
print(proc.time() - sys.time);remove(sys.time)

Do this for several sizes of training data (at least 3 because it won't be linear!) and you can extrapolate to the full 200,000 rows. There is no hard and fast rule to how to select subset sizes so be cautious and start small - there is no point waiting an hour to run half the subset just to get a better prediction!

How to approximate the time for running a ML code with parallel computing in R?

Answers (1)

1. 15 minutes is not that long

2. Subset the input data and estimate the full run time

Related Questions