Abiel
Abiel

Reputation: 5455

Programmatically generating a list of columns to be assigned to data.table with `:=` syntax

In data.table, I can generate a list of new columns that are immediately assigned to the table using the `:=` syntax, like so:

x <- data.table(x1=1:5, x2=1:5)
x[, `:=` (x3=x1+2, x4=x2*3)]

Alternatively, I could have done the following:

x[, c("x3","x4") := list(x1+2, x2*3)]

I would like to do something like the first method, but have the right hand side of the assignment statement be built up automatically using a custom function. For example, suppose I want a function that will accept a set of column names, then generate new columns that are the mean of the given columns, with the column name being equal to the original column plus some suffix. For example,

x[, `:=` MEAN(x1,x2)]

would yield the same result as

x[, `:=` (x1_mean=mean(x1), x2_mean=mean(x2))]

Is this possible in data.table? I realize this is possible if I'm willing to pass in a list of column names like in the c("x3","x4") := ... example, but I want to avoid this so I don't have to write as much code.

Upvotes: 5

Views: 179

Answers (1)

Frank
Frank

Reputation: 66819

Just refer to the function by name:

myfun <- "mean"
x[,paste(names(x),myfun,sep="_"):=lapply(.SD,myfun)]
#    x1 x2 x1_mean x2_mean
# 1:  1  1       3       3
# 2:  2  2       3       3
# 3:  3  3       3       3
# 4:  4  4       3       3
# 5:  5  5       3       3

Customization is straightforward:

divby2 <- function(x) x/2 # custom function
myfun  <- "divby2"
mycols <- "x1"            # custom columns
x[,paste(mycols,myfun,sep="_"):=lapply(.SD,myfun),.SDcols=mycols]
#    x1 x2 x1_mean x2_mean x1_divby2
# 1:  1  1       3       3       0.5
# 2:  2  2       3       3       1.0
# 3:  3  3       3       3       1.5
# 4:  4  4       3       3       2.0
# 5:  5  5       3       3       2.5

We may some day have syntax like paste(.SDcols,myfun,sep="_"):=lapply(.SD,myfun), but .SDcols on the left-hand side is not supported currently.


Making a function. If you want a function to do this, there's

add_myfun <- function(DT,myfun,mycols){
  DT[,paste(mycols,myfun,sep="_"):=lapply(.SD,myfun),.SDcols=mycols]
}
add_myfun(x,"median","x2")

Can a function be written that will work inside j of DT[i,j]? Maybe. But I think it's not a good idea.

  1. Can you be sure your function will be robust to all the other uses of j, like by?
  2. Can your function take advantage of data.table's optimization (e.g., of mean)?
  3. Will anyone else be able to read your code?
  4. Using [ can be slow. If you're doing this for many columns, you might be better off initializing the new columns and assigning with set.

Upvotes: 2

Related Questions