K_D
K_D

Reputation: 147

Speeding up subsetting of data.table and implementation of thousands of regression

I have a data.table with 100 rows and 3 columns. The rows are grouped into 30 groups. The three columns are my independent variables.

During each iteration, I randomly pick one row from each group and create a subset containing 30 rows.

I then join the subset to another data.table containing my dependent variable.

There are several thousands of possible combinations. I tried to speed up the code as shown below using foreach as shown below. I have so far tried for 1000 iterations and it seemed to have helped but as I will have to execute several thousands of combinations more, I am wondering if there are ways to be more efficient or faster.

library(parallel)
library(foreach)
library(doParallel)

#data.table containing all independent values
ids <- vector()
#my experiment results in multiple rows per group. Creating such repetitive  
#group ids was surprisingly not very straight forward 
for(i in 1:100){ids[i] <- sample(1:30,1)}
ids <- sort(ids)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- rnorm(100)
dd1 <- data.table(ids,x1,x2,x3)

#data.table containing all dependent values
ids <- seq(1:30)
y <- rnorm(30)
dd2 <- data.table(ids,y)

clus <- makeCluster(detectCores() - 1)
registerDoParallel(clus, cores = detectCores() - 1)


out <- foreach(i = 1:1000, .packages=c("dplyr", "data.table", "caret"), .combine='c') %dopar% {
  dd3 <- dd1[, .SD[sample(.N, min(1,.N))], by = ids]
  dd3 <- right_join(dd2, dd3, by="ids")

  model <- train(y~x1+x2+x3,
                 data = dd3,
                 method = "lm",
                 trControl = trainControl(method="LOOCV"))
  list(model$results$RMSE,
       model$results$Rsquared,
       model$results$MAE)
}
stopCluster(clus)

I have recently started getting used to the syntax of data.table. I find it easier to depend on some of the dplyr functions to save time. There may be a few inconsistencies. I look forward to any suggestions for improvement.

Thank you

Upvotes: 0

Views: 81

Answers (1)

mt1022
mt1022

Reputation: 17299

As shown in the benchmark below, the rate-limiting step is model training. Even if the data.table subsetting time is reduced by 87%, the overall run time is almost the same.

library(data.table)
library(caret)
library(microbenchmark)

microbenchmark(
    a = {
        dd3 <- dd1[, .SD[sample(.N, min(1,.N))], by = ids]
        dd3 <- right_join(dd2, dd3, by="ids")
    },
    b = {
        dd3 <- dd1[sample.int(nrow(dd1))][order(ids)][!duplicated(ids)]
        dd3[, y := dd2$y]
    }, times = 10)
# Unit: microseconds
#  expr      min       lq      mean    median       uq      max neval
#     a 5151.775 5178.159 5248.1007 5214.2990 5260.367 5517.200    10
#     b  661.024  671.663  729.1066  699.2115  744.380  988.915    10

microbenchmark(
    a = {
        dd3 <- dd1[, .SD[sample(.N, min(1,.N))], by = ids]
        dd3 <- right_join(dd2, dd3, by="ids")
        model <- train(y~x1+x2+x3, data = dd3, method = "lm", trControl = trainControl(method="LOOCV"))
        list(model$results$RMSE, model$results$Rsquared, model$results$MAE)
    },
    b = {
        dd3 <- dd1[sample.int(nrow(dd1))][order(ids)][!duplicated(ids)]
        dd3[, y := dd2$y]
        model <- train(y~x1+x2+x3, data = dd3, method = "lm", trControl = trainControl(method="LOOCV"))
        list(model$results$RMSE, model$results$Rsquared, model$results$MAE)
    }, times = 10)
# Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval
#     a 450.1885 451.4723 454.9538 452.6399 459.6504 463.7085    10
#     b 445.2466 446.8068 449.4441 447.1629 450.0173 460.8545    10

Upvotes: 2

Related Questions