What would be the best way to improve calcul performance in a big data.table?

Question

In a single data.table, I have many calculs to perform. Simple, but combining many configurations : creating X variables from Y others, making groups based on X different variables, etc...

Step by step, I manage to perform all the calculations I need (with my knowledge in data.table), but my real challenge is called PERFORMANCE. My data.table contains millions of lines, and the calculations are made on dozens and dozens of columns.

What I would like to know :

Is there a better way to write this code to improve performance ?
Some of my options do not work (1.3 and 2.2, tag with KO) : good approach ? How to write it ?
My microbenchmark seems to show me that the best option depends on the number of lines ? Right ?

Here is my code with a reprex :

library(data.table)
library(stringr)
library(microbenchmark)

n.row <- 1e5

foo <- data.table(id = 101:(101+n.row-1), 
                  crit = rep(c('fr', 'ca', 'al', 'se', 'is'), 5),
                  val_1 = round(runif(n.row, 0.5, 50), digits = 2),
                  val_2 = round(runif(n.row, 1, 20), digits = 0),
                  val_3 = round(runif(n.row, 1, 5), digits = 0),
                  eff = 28500,
                  num = sample(0:1,n.row, replace = TRUE),
                  num_2 = round(runif(n.row, 1, 10), digits = 1),
                  num_17 = round(runif(n.row, 1, 10), digits = 1),
                  num_69 = round(runif(n.row, 0, 1), digits = 2),
                  num_5 = round(runif(n.row, 10, 20), digits = 0),
                  cof = round(runif(n.row, 0.1, 2), digits = 5),
                  ToDo = rep(1, n.row),
                  grp_1 = sample(LETTERS[c(1,3)], n.row, replace = TRUE))


foo[, c("grp_2", "grp_3") := {
  grp_2 = fcase(grp_1 %in% LETTERS[c(1)], sample(LETTERS[c(5,8,9)], n.row, replace = TRUE),
                grp_1 %in% LETTERS[c(3)], sample(LETTERS[c(14,16)], n.row, replace = TRUE))
  
  grp_3 = fcase(grp_1 %in% LETTERS[c(1)], sample(LETTERS[c(20:23)], n.row, replace = TRUE),
                grp_1 %in% LETTERS[c(3)], sample(LETTERS[c(24:26)], n.row, replace = TRUE))
  
  list(grp_2, grp_3)
}]

# Calcul sd and qa
foo[, sd := (val_1 * cof)]
foo[num == 1, qa := (val_2 * cof)]
foo[num != 1, qa := (val_3 * cof)]

foo1 <- copy(foo)
foo2 <- copy(foo)
foo3 <- copy(foo)

# calcul of qa_X 
var.calc <- names(foo)[str_which(names(foo), "^num.\d+$")]

# 1.1
for (j in var.calc){
  foo1[, paste0("qa_", str_extract(j, "\d+$")) := qa * get(j)]
}

# 1.2
setDT(foo2)[, paste0("qa_", str_extract(var.calc, "\d+$")) := lapply(.SD, function(x) x * qa), .SDcols = var.calc ]

# 1.3 KO
for (j in var.calc){ set(foo3, paste0("qa_", str_extract(j, "\d+$")) := qa * get(j)) }

# comparaison
mbm <- microbenchmark(
  Test.for = for (j in var.calc){ foo1[, paste0("qa_", str_extract(j, "\d+$")) := qa * get(j)] },
  Test.set = setDT(foo2)[, paste0("qa_", str_extract(var.calc, "\d+$")) := lapply(.SD, function(x) x * qa), .SDcols = var.calc ],
  times = 10
)

mbm

# calcul by groups
var.grp <- names(foo)[grepl("^grp.\d+$", names(foo))]

# 2.1
for (j in var.grp) {
  foo1[, paste0("s.sd.", j) := sum(sd, na.rm = TRUE), by = get(j)]
  foo1[, paste0("s.qa.", j) := sum(qa, na.rm = TRUE), by = get(j)]
}

# 2.2 KO 
setDT(foo2)[, paste0("s.sd.", var.grp) := lapply(.SD, function(x) sum(x)), .SDcols = var.calc, by = .SD ]

Many thanks for helping or suggestions.

(If I have to split my request, I will).

What would be the best way to improve calcul performance in a big data.table?

Answers (1)

Related Questions