Reputation: 147
I have a data.table
with 100 rows and 3 columns. The rows are grouped into 30 groups. The three columns are my independent variables.
During each iteration, I randomly pick one row from each group and create a subset containing 30 rows.
I then join the subset to another data.table
containing my dependent variable.
There are several thousands of possible combinations. I tried to speed up the code as shown below using foreach
as shown below. I have so far tried for 1000 iterations and it seemed to have helped but as I will have to execute several thousands of combinations more, I am wondering if there are ways to be more efficient or faster.
library(parallel)
library(foreach)
library(doParallel)
#data.table containing all independent values
ids <- vector()
#my experiment results in multiple rows per group. Creating such repetitive
#group ids was surprisingly not very straight forward
for(i in 1:100){ids[i] <- sample(1:30,1)}
ids <- sort(ids)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- rnorm(100)
dd1 <- data.table(ids,x1,x2,x3)
#data.table containing all dependent values
ids <- seq(1:30)
y <- rnorm(30)
dd2 <- data.table(ids,y)
clus <- makeCluster(detectCores() - 1)
registerDoParallel(clus, cores = detectCores() - 1)
out <- foreach(i = 1:1000, .packages=c("dplyr", "data.table", "caret"), .combine='c') %dopar% {
dd3 <- dd1[, .SD[sample(.N, min(1,.N))], by = ids]
dd3 <- right_join(dd2, dd3, by="ids")
model <- train(y~x1+x2+x3,
data = dd3,
method = "lm",
trControl = trainControl(method="LOOCV"))
list(model$results$RMSE,
model$results$Rsquared,
model$results$MAE)
}
stopCluster(clus)
I have recently started getting used to the syntax of data.table
. I find it easier to depend on some of the dplyr
functions to save time. There may be a few inconsistencies. I look forward to any suggestions for improvement.
Thank you
Upvotes: 0
Views: 81
Reputation: 17299
As shown in the benchmark below, the rate-limiting step is model training. Even if the data.table subsetting time is reduced by 87%, the overall run time is almost the same.
library(data.table)
library(caret)
library(microbenchmark)
microbenchmark(
a = {
dd3 <- dd1[, .SD[sample(.N, min(1,.N))], by = ids]
dd3 <- right_join(dd2, dd3, by="ids")
},
b = {
dd3 <- dd1[sample.int(nrow(dd1))][order(ids)][!duplicated(ids)]
dd3[, y := dd2$y]
}, times = 10)
# Unit: microseconds
# expr min lq mean median uq max neval
# a 5151.775 5178.159 5248.1007 5214.2990 5260.367 5517.200 10
# b 661.024 671.663 729.1066 699.2115 744.380 988.915 10
microbenchmark(
a = {
dd3 <- dd1[, .SD[sample(.N, min(1,.N))], by = ids]
dd3 <- right_join(dd2, dd3, by="ids")
model <- train(y~x1+x2+x3, data = dd3, method = "lm", trControl = trainControl(method="LOOCV"))
list(model$results$RMSE, model$results$Rsquared, model$results$MAE)
},
b = {
dd3 <- dd1[sample.int(nrow(dd1))][order(ids)][!duplicated(ids)]
dd3[, y := dd2$y]
model <- train(y~x1+x2+x3, data = dd3, method = "lm", trControl = trainControl(method="LOOCV"))
list(model$results$RMSE, model$results$Rsquared, model$results$MAE)
}, times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# a 450.1885 451.4723 454.9538 452.6399 459.6504 463.7085 10
# b 445.2466 446.8068 449.4441 447.1629 450.0173 460.8545 10
Upvotes: 2