Reputation: 5455
In data.table
, I can generate a list of new columns that are immediately assigned to the table using the `:=`
syntax, like so:
x <- data.table(x1=1:5, x2=1:5)
x[, `:=` (x3=x1+2, x4=x2*3)]
Alternatively, I could have done the following:
x[, c("x3","x4") := list(x1+2, x2*3)]
I would like to do something like the first method, but have the right hand side of the assignment statement be built up automatically using a custom function. For example, suppose I want a function that will accept a set of column names, then generate new columns that are the mean of the given columns, with the column name being equal to the original column plus some suffix. For example,
x[, `:=` MEAN(x1,x2)]
would yield the same result as
x[, `:=` (x1_mean=mean(x1), x2_mean=mean(x2))]
Is this possible in data.table
? I realize this is possible if I'm willing to pass in a list of column names like in the c("x3","x4") := ...
example, but I want to avoid this so I don't have to write as much code.
Upvotes: 5
Views: 179
Reputation: 66819
Just refer to the function by name:
myfun <- "mean"
x[,paste(names(x),myfun,sep="_"):=lapply(.SD,myfun)]
# x1 x2 x1_mean x2_mean
# 1: 1 1 3 3
# 2: 2 2 3 3
# 3: 3 3 3 3
# 4: 4 4 3 3
# 5: 5 5 3 3
Customization is straightforward:
divby2 <- function(x) x/2 # custom function
myfun <- "divby2"
mycols <- "x1" # custom columns
x[,paste(mycols,myfun,sep="_"):=lapply(.SD,myfun),.SDcols=mycols]
# x1 x2 x1_mean x2_mean x1_divby2
# 1: 1 1 3 3 0.5
# 2: 2 2 3 3 1.0
# 3: 3 3 3 3 1.5
# 4: 4 4 3 3 2.0
# 5: 5 5 3 3 2.5
We may some day have syntax like paste(.SDcols,myfun,sep="_"):=lapply(.SD,myfun)
, but .SDcols
on the left-hand side is not supported currently.
Making a function. If you want a function to do this, there's
add_myfun <- function(DT,myfun,mycols){
DT[,paste(mycols,myfun,sep="_"):=lapply(.SD,myfun),.SDcols=mycols]
}
add_myfun(x,"median","x2")
Can a function be written that will work inside j
of DT[i,j]
? Maybe. But I think it's not a good idea.
j
, like by
? data.table
's optimization (e.g., of mean
)?[
can be slow. If you're doing this for many columns, you might be better off initializing the new columns and assigning with set
.Upvotes: 2