Factuary
Factuary

Reputation: 43

How can you improve computation time when predicting KNN Imputation?

I feel like my run time is extremely slow for my data set, this is the code:

    library(caret)
    library(data.table)
    knnImputeValues <- preProcess(mainData[trainingRows, imputeColumns], method = c("zv", "knnImpute"))
    knnTransformed <- predict(knnImputeValues, mainData[ 1:1000, imputeColumns])

the PreProcess into knnImputeValues run's fairly quickly, however the predict function takes a tremendous amount of time. When I calculated it on a subset of the data this was the result:

testtime <- system.time(knnTransformed <- predict(knnImputeValues, mainData[ 1:15000, imputeColumns
testtime

user     969.78
system   38.70 
elapsed  1010.72 

Additionally, it should be noted that caret preprocess uses "RANN".

Now my full dataset is:

 str(mainData[ , imputeColumns])
'data.frame':   1809032 obs. of  16 variables:
 $ V1: int  3 5 5 4 4 4 3 4 3 3 ...
 $ V2: Factor w/ 3 levels "1000000","1500000",..: 1 1 3 1 1 1 1 3 1 1 ...
 $ V3: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ V4: int  2 5 5 12 4 5 11 8 7 8 ...
 $ V5: int  2 0 0 2 0 0 1 3 2 8 ...
 $ V6: int  648 489 489 472 472 472 497 642 696 696 ...
 $ V7: Factor w/ 4 levels "","N","U","Y": 4 1 1 1 1 1 1 1 1 1 ...
 $ V8: int  0 0 0 0 0 0 0 1 1 1 ...
 $ V9: num  0 0 0 0 0 ...
 $ V10: Factor w/ 56 levels "1","2","3","4",..: 45 19 19 19 19 19 19 46 46 46 ...
 $ V11: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ V12: num  2 5 5 12 4 5 11 8 7 8 ...
 $ V13: num  2 0 0 2 0 0 1 3 2 8 ...
 $ V14: Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 3 3 ...
 $ V15: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ...
 $ V16: num  657 756 756 756 756 ...

So is there something I'm doing wrong, or is this typical for how long it will take to run this? If you back of the envelop extrapolate (which I know isn't entire accurate) you'd get what 33 days?

Also it looks like system time is very low and user time is very high, is that normal?

My computer is a laptop, with a Intel(R) Core(TM) i5-6300U CPU @ 2.40Ghz processor.

Additionally would this improve the runtime of the predict function?

cl <- makeCluster(4)
registerDoParallel()

I tried it, and it didn't seem to make a difference other than all the processors looked more active in my task manager.

FOCUSED QUESTION: I'm using Caret package to do KNN Imputation on 1.8 Million Rows, the way I'm currently doing it will take over a month to run, how do I write this in such a way that I could do it in a much faster amount of time(if possible)?

Thank you for any help provided. And the answer might very well be "that's how long it takes don't bother" I just want to rule out any possible mistakes.

Upvotes: 2

Views: 4894

Answers (1)

alexwhitworth
alexwhitworth

Reputation: 4907

You can speed this up via the imputation package and use of canopies which can be installed from Github:

Sys.setenv("PKG_CXXFLAGS"="-std=c++0x")
devtools::install_github("alexwhitworth/imputation")

Canopies use a cheap distance metric--in this case distance from the data mean vector--to get approximate neighbors. In general, we wish to keep the canopies each sized < 100k so for 1.8M rows, we'll use 20 canopies:

library("imputation")
to_impute <- mainData[trainingRows, imputeColumns] ## OP undefined
imputed <- kNN_impute(to_impute, k= 10, q= 2, verbose= TRUE, 
                      parallel= TRUE, n_canopies= 20)

NOTE:

The imputation package requires numeric data inputs. You have several factor variables in your str output. They will cause this to fail.

You'll also get some mean vector imputation if you have fulling missing rows.

# note this example data is too small for canopies to be useful
# meant solely to illustrate
set.seed(2143L)
x1 <- matrix(rnorm(1000), 100, 10)
x1[sample(1:1000, size= 50, replace= FALSE)] <- NA
x_imp <- kNN_impute(x1, k=5, q=2, n_canopies= 10)
sum(is.na(x_imp[[1]])) # 0

# with fully missing rows
x2 <- x1; x2[5,] <- NA
x_imp <- kNN_impute(x2, k=5, q=2, n_canopies= 10)
[1] "Computing canopies kNN solution provided within canopies"
[1] "Canopies complete... calculating kNN."
row(s) 1 are entirely missing. 
                     These row(s)' values will be imputed to column means.
Warning message:
In FUN(X[[i]], ...) :
  Rows with entirely missing values imputed to column means.

Upvotes: 1

Related Questions