Discus23
Discus23

Reputation: 521

What would be the best way to improve calcul performance in a big data.table?

In a single data.table, I have many calculs to perform. Simple, but combining many configurations : creating X variables from Y others, making groups based on X different variables, etc...

Step by step, I manage to perform all the calculations I need (with my knowledge in data.table), but my real challenge is called PERFORMANCE. My data.table contains millions of lines, and the calculations are made on dozens and dozens of columns.

What I would like to know :

Here is my code with a reprex :

library(data.table)
library(stringr)
library(microbenchmark)

n.row <- 1e5

foo <- data.table(id = 101:(101+n.row-1), 
                  crit = rep(c('fr', 'ca', 'al', 'se', 'is'), 5),
                  val_1 = round(runif(n.row, 0.5, 50), digits = 2),
                  val_2 = round(runif(n.row, 1, 20), digits = 0),
                  val_3 = round(runif(n.row, 1, 5), digits = 0),
                  eff = 28500,
                  num = sample(0:1,n.row, replace = TRUE),
                  num_2 = round(runif(n.row, 1, 10), digits = 1),
                  num_17 = round(runif(n.row, 1, 10), digits = 1),
                  num_69 = round(runif(n.row, 0, 1), digits = 2),
                  num_5 = round(runif(n.row, 10, 20), digits = 0),
                  cof = round(runif(n.row, 0.1, 2), digits = 5),
                  ToDo = rep(1, n.row),
                  grp_1 = sample(LETTERS[c(1,3)], n.row, replace = TRUE))


foo[, c("grp_2", "grp_3") := {
  grp_2 = fcase(grp_1 %in% LETTERS[c(1)], sample(LETTERS[c(5,8,9)], n.row, replace = TRUE),
                grp_1 %in% LETTERS[c(3)], sample(LETTERS[c(14,16)], n.row, replace = TRUE))
  
  grp_3 = fcase(grp_1 %in% LETTERS[c(1)], sample(LETTERS[c(20:23)], n.row, replace = TRUE),
                grp_1 %in% LETTERS[c(3)], sample(LETTERS[c(24:26)], n.row, replace = TRUE))
  
  list(grp_2, grp_3)
}]

# Calcul sd and qa
foo[, sd := (val_1 * cof)]
foo[num == 1, qa := (val_2 * cof)]
foo[num != 1, qa := (val_3 * cof)]

foo1 <- copy(foo)
foo2 <- copy(foo)
foo3 <- copy(foo)

# calcul of qa_X 
var.calc <- names(foo)[str_which(names(foo), "^num.\\d+$")]

# 1.1
for (j in var.calc){
  foo1[, paste0("qa_", str_extract(j, "\\d+$")) := qa * get(j)]
}

# 1.2
setDT(foo2)[, paste0("qa_", str_extract(var.calc, "\\d+$")) := lapply(.SD, function(x) x * qa), .SDcols = var.calc ]

# 1.3 KO
for (j in var.calc){ set(foo3, paste0("qa_", str_extract(j, "\\d+$")) := qa * get(j)) }

# comparaison
mbm <- microbenchmark(
  Test.for = for (j in var.calc){ foo1[, paste0("qa_", str_extract(j, "\\d+$")) := qa * get(j)] },
  Test.set = setDT(foo2)[, paste0("qa_", str_extract(var.calc, "\\d+$")) := lapply(.SD, function(x) x * qa), .SDcols = var.calc ],
  times = 10
)

mbm

# calcul by groups
var.grp <- names(foo)[grepl("^grp.\\d+$", names(foo))]

# 2.1
for (j in var.grp) {
  foo1[, paste0("s.sd.", j) := sum(sd, na.rm = TRUE), by = get(j)]
  foo1[, paste0("s.qa.", j) := sum(qa, na.rm = TRUE), by = get(j)]
}

# 2.2 KO 
setDT(foo2)[, paste0("s.sd.", var.grp) := lapply(.SD, function(x) sum(x)), .SDcols = var.calc, by = .SD ]

Many thanks for helping or suggestions.

(If I have to split my request, I will).

Upvotes: 0

Views: 59

Answers (1)

minem
minem

Reputation: 3650

  1. question: I would use:

    for (j in var.calc) set(foo3, j = paste0("qa_", str_extract(j, "\\d+$")), value = foo3$qa * foo3[[j]])
    

(fixed 1.3 example)

  1. question: 2.1 seams fine

Notes:

  • you don't need to constantly use setDT(foo2)
  • read documentation of data.table! there are lots of useful example, etc.: https://rdatatable.gitlab.io/data.table/
  • don't look at microbenchmark's, try the code on your real data and time that, because the results(time) will be different and the overhead, that some of data.tables functions have, will be insignificant.

Upvotes: 2

Related Questions