How can you improve computation time when predicting KNN Imputation?

Question

I feel like my run time is extremely slow for my data set, this is the code:

    library(caret)
    library(data.table)
    knnImputeValues <- preProcess(mainData[trainingRows, imputeColumns], method = c("zv", "knnImpute"))
    knnTransformed <- predict(knnImputeValues, mainData[ 1:1000, imputeColumns])

the PreProcess into knnImputeValues run's fairly quickly, however the predict function takes a tremendous amount of time. When I calculated it on a subset of the data this was the result:

testtime <- system.time(knnTransformed <- predict(knnImputeValues, mainData[ 1:15000, imputeColumns
testtime

user     969.78
system   38.70 
elapsed  1010.72

Additionally, it should be noted that caret preprocess uses "RANN".

Now my full dataset is:

 str(mainData[ , imputeColumns])
'data.frame':   1809032 obs. of  16 variables:
 $ V1: int  3 5 5 4 4 4 3 4 3 3 ...
 $ V2: Factor w/ 3 levels "1000000","1500000",..: 1 1 3 1 1 1 1 3 1 1 ...
 $ V3: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ V4: int  2 5 5 12 4 5 11 8 7 8 ...
 $ V5: int  2 0 0 2 0 0 1 3 2 8 ...
 $ V6: int  648 489 489 472 472 472 497 642 696 696 ...
 $ V7: Factor w/ 4 levels "","N","U","Y": 4 1 1 1 1 1 1 1 1 1 ...
 $ V8: int  0 0 0 0 0 0 0 1 1 1 ...
 $ V9: num  0 0 0 0 0 ...
 $ V10: Factor w/ 56 levels "1","2","3","4",..: 45 19 19 19 19 19 19 46 46 46 ...
 $ V11: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ V12: num  2 5 5 12 4 5 11 8 7 8 ...
 $ V13: num  2 0 0 2 0 0 1 3 2 8 ...
 $ V14: Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 3 3 ...
 $ V15: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ...
 $ V16: num  657 756 756 756 756 ...

So is there something I'm doing wrong, or is this typical for how long it will take to run this? If you back of the envelop extrapolate (which I know isn't entire accurate) you'd get what 33 days?

Also it looks like system time is very low and user time is very high, is that normal?

My computer is a laptop, with a Intel(R) Core(TM) i5-6300U CPU @ 2.40Ghz processor.

Additionally would this improve the runtime of the predict function?

cl <- makeCluster(4)
registerDoParallel()

I tried it, and it didn't seem to make a difference other than all the processors looked more active in my task manager.

FOCUSED QUESTION: I'm using Caret package to do KNN Imputation on 1.8 Million Rows, the way I'm currently doing it will take over a month to run, how do I write this in such a way that I could do it in a much faster amount of time(if possible)?

Thank you for any help provided. And the answer might very well be "that's how long it takes don't bother" I just want to rule out any possible mistakes.

How can you improve computation time when predicting KNN Imputation?

Answers (1)

NOTE:

Related Questions