Reputation: 521
In a single data.table
, I have many calculs to perform. Simple, but combining many configurations : creating X variables from Y others, making groups based on X different variables, etc...
Step by step, I manage to perform all the calculations I need (with my knowledge in data.table), but my real challenge is called PERFORMANCE. My data.table
contains millions of lines, and the calculations are made on dozens and dozens of columns.
What I would like to know :
microbenchmark
seems to show me that the best option depends on the number of lines ? Right ?Here is my code with a reprex :
library(data.table)
library(stringr)
library(microbenchmark)
n.row <- 1e5
foo <- data.table(id = 101:(101+n.row-1),
crit = rep(c('fr', 'ca', 'al', 'se', 'is'), 5),
val_1 = round(runif(n.row, 0.5, 50), digits = 2),
val_2 = round(runif(n.row, 1, 20), digits = 0),
val_3 = round(runif(n.row, 1, 5), digits = 0),
eff = 28500,
num = sample(0:1,n.row, replace = TRUE),
num_2 = round(runif(n.row, 1, 10), digits = 1),
num_17 = round(runif(n.row, 1, 10), digits = 1),
num_69 = round(runif(n.row, 0, 1), digits = 2),
num_5 = round(runif(n.row, 10, 20), digits = 0),
cof = round(runif(n.row, 0.1, 2), digits = 5),
ToDo = rep(1, n.row),
grp_1 = sample(LETTERS[c(1,3)], n.row, replace = TRUE))
foo[, c("grp_2", "grp_3") := {
grp_2 = fcase(grp_1 %in% LETTERS[c(1)], sample(LETTERS[c(5,8,9)], n.row, replace = TRUE),
grp_1 %in% LETTERS[c(3)], sample(LETTERS[c(14,16)], n.row, replace = TRUE))
grp_3 = fcase(grp_1 %in% LETTERS[c(1)], sample(LETTERS[c(20:23)], n.row, replace = TRUE),
grp_1 %in% LETTERS[c(3)], sample(LETTERS[c(24:26)], n.row, replace = TRUE))
list(grp_2, grp_3)
}]
# Calcul sd and qa
foo[, sd := (val_1 * cof)]
foo[num == 1, qa := (val_2 * cof)]
foo[num != 1, qa := (val_3 * cof)]
foo1 <- copy(foo)
foo2 <- copy(foo)
foo3 <- copy(foo)
# calcul of qa_X
var.calc <- names(foo)[str_which(names(foo), "^num.\\d+$")]
# 1.1
for (j in var.calc){
foo1[, paste0("qa_", str_extract(j, "\\d+$")) := qa * get(j)]
}
# 1.2
setDT(foo2)[, paste0("qa_", str_extract(var.calc, "\\d+$")) := lapply(.SD, function(x) x * qa), .SDcols = var.calc ]
# 1.3 KO
for (j in var.calc){ set(foo3, paste0("qa_", str_extract(j, "\\d+$")) := qa * get(j)) }
# comparaison
mbm <- microbenchmark(
Test.for = for (j in var.calc){ foo1[, paste0("qa_", str_extract(j, "\\d+$")) := qa * get(j)] },
Test.set = setDT(foo2)[, paste0("qa_", str_extract(var.calc, "\\d+$")) := lapply(.SD, function(x) x * qa), .SDcols = var.calc ],
times = 10
)
mbm
# calcul by groups
var.grp <- names(foo)[grepl("^grp.\\d+$", names(foo))]
# 2.1
for (j in var.grp) {
foo1[, paste0("s.sd.", j) := sum(sd, na.rm = TRUE), by = get(j)]
foo1[, paste0("s.qa.", j) := sum(qa, na.rm = TRUE), by = get(j)]
}
# 2.2 KO
setDT(foo2)[, paste0("s.sd.", var.grp) := lapply(.SD, function(x) sum(x)), .SDcols = var.calc, by = .SD ]
Many thanks for helping or suggestions.
(If I have to split my request, I will).
Upvotes: 0
Views: 59
Reputation: 3650
question: I would use:
for (j in var.calc) set(foo3, j = paste0("qa_", str_extract(j, "\\d+$")), value = foo3$qa * foo3[[j]])
(fixed 1.3 example)
Notes:
setDT(foo2)
data.table
! there are lots of useful example, etc.: https://rdatatable.gitlab.io/data.table/microbenchmark
's, try the code on your real data and time that, because the results(time) will be different and the overhead, that some of data.table
s functions have, will be insignificant.Upvotes: 2