Speeding up subsetting of data.table and implementation of thousands of regression

Question

I have a data.table with 100 rows and 3 columns. The rows are grouped into 30 groups. The three columns are my independent variables.

During each iteration, I randomly pick one row from each group and create a subset containing 30 rows.

I then join the subset to another data.table containing my dependent variable.

There are several thousands of possible combinations. I tried to speed up the code as shown below using foreach as shown below. I have so far tried for 1000 iterations and it seemed to have helped but as I will have to execute several thousands of combinations more, I am wondering if there are ways to be more efficient or faster.

library(parallel)
library(foreach)
library(doParallel)

#data.table containing all independent values
ids <- vector()
#my experiment results in multiple rows per group. Creating such repetitive  
#group ids was surprisingly not very straight forward 
for(i in 1:100){ids[i] <- sample(1:30,1)}
ids <- sort(ids)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- rnorm(100)
dd1 <- data.table(ids,x1,x2,x3)

#data.table containing all dependent values
ids <- seq(1:30)
y <- rnorm(30)
dd2 <- data.table(ids,y)

clus <- makeCluster(detectCores() - 1)
registerDoParallel(clus, cores = detectCores() - 1)


out <- foreach(i = 1:1000, .packages=c("dplyr", "data.table", "caret"), .combine='c') %dopar% {
  dd3 <- dd1[, .SD[sample(.N, min(1,.N))], by = ids]
  dd3 <- right_join(dd2, dd3, by="ids")

  model <- train(y~x1+x2+x3,
                 data = dd3,
                 method = "lm",
                 trControl = trainControl(method="LOOCV"))
  list(model$results$RMSE,
       model$results$Rsquared,
       model$results$MAE)
}
stopCluster(clus)

I have recently started getting used to the syntax of data.table. I find it easier to depend on some of the dplyr functions to save time. There may be a few inconsistencies. I look forward to any suggestions for improvement.

Thank you

Speeding up subsetting of data.table and implementation of thousands of regression

Answers (1)

Related Questions