Reputation: 691
When iterating through all columns in an R data.table using reference semantics, what makes more sense from a memory usage standpoint:
(1) dt[, (all_cols) := lapply(.SD, my_fun)]
or
(2) lapply(colnames(dt), function(col) dt[, (col) := my_fun(dt[[col]])])[[1]]
My question is: In (2), I am forcing data.table to overwrite dt
on a column by column basis, so I would assume to need extra memory on the order of column size. Is this also the case for (1)? Or is all of lapply(.SD, my_fun)
evaluated before the original columns are overwritten?
Some sample code to run the above variants:
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
Upvotes: 1
Views: 161
Reputation: 691
Following the suggestion of @Frank, the most efficient way (from a memory point of view) to replace a data.table column by column by applying a function my_fun
to each column, is
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
for (col in all_cols) set(dt, j = col, value = my_fun(dt[[col]]))
This currently (v1.11.4) is not handled in the same way as an expression like dt[, lapply(.SD, my_fun)]
which internally is optimised to dt[, list(fun(a), fun(b), ...)]
, where a, b, ...
are columns in .SD
(see ?datatable.optimize
). This might change in the future and is being tracked by #1414.
Upvotes: 2