data.table reference semantics: memory usage of iterating through all columns

Question

When iterating through all columns in an R data.table using reference semantics, what makes more sense from a memory usage standpoint:

(1) dt[, (all_cols) := lapply(.SD, my_fun)]

or

(2) lapply(colnames(dt), function(col) dt[, (col) := my_fun(dt[[col]])])[[1]]

My question is: In (2), I am forcing data.table to overwrite dt on a column by column basis, so I would assume to need extra memory on the order of column size. Is this also the case for (1)? Or is all of lapply(.SD, my_fun) evaluated before the original columns are overwritten?

Some sample code to run the above variants:

library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)

nbenn · Accepted Answer

Following the suggestion of @Frank, the most efficient way (from a memory point of view) to replace a data.table column by column by applying a function my_fun to each column, is

library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)

for (col in all_cols) set(dt, j = col, value = my_fun(dt[[col]]))

This currently (v1.11.4) is not handled in the same way as an expression like dt[, lapply(.SD, my_fun)] which internally is optimised to dt[, list(fun(a), fun(b), ...)], where a, b, ... are columns in .SD (see ?datatable.optimize). This might change in the future and is being tracked by #1414.

data.table reference semantics: memory usage of iterating through all columns

Answers (1)

Related Questions