sds
sds

Reputation: 60004

data.table operations by column name

Suppose I have a data.table

a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")

I can add new columns like this:

a[, sa := sum(a), by="id"]
a[, sb := sum(b), by="id"]
> a
   id  a  b sa sb
1:  1 21 11 43 23
2:  1 22 12 43 23
3:  2 23 13 47 27
4:  2 24 14 47 27
5:  3 25 15 25 15

However, suppose that I have column names instead:

for (n in c("a","b")) {
  s <- paste0("s",n)
  a[, s := sum(n), by="id", with=FALSE] # ERROR: invalid 'type' (character) of argument
}

what do I do?

Upvotes: 12

Views: 2642

Answers (4)

Beasterfield
Beasterfield

Reputation: 7113

Edit 2020-02-15 about ..

data.table also supports the .. syntax to "look up a level", obviating the need for with=FALSE in most cases, e.g. dt[ , ..n1] and dt[ , ..n2] in the below


have a look at with in ? data.table:

dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[, n3 := dt[ , n1, with = FALSE ] * dt[ , n2, with = FALSE ], with = FALSE ]

EDIT:

Or you just change the colnames forth and back:

dt <- data.table(id=1:5,a=21:25,b=11:15,key="id")
dt[ , dt.names["n3"] := 1L, with = FALSE ]

dt.names <- c( n1 = "a", n2 = "b", n3 = "c" )
setnames( dt, dt.names, names(dt.names) )

dt[ , n3 := n1 * n2, by = "id" ]
setnames( dt, names(dt.names), dt.names )

which works together with by.

Upvotes: 2

mnel
mnel

Reputation: 115382

Here is an approach that does the call mangling and avoids any overhead with .SD

# a helper function
makeCall <- function(x,fun) bquote(.(fun)(.(x)))
# the columns you wish to sum (apply function to)
cols <- c('a','b')
new.cols <- paste0('s',cols)
# create named list of names
name.cols <- setNames(sapply(cols,as.name), new.cols)
# create the call
my_call <-  as.call(c(as.name(':='), lapply(name.cols, makeCall, fun = as.name('sum'))))
(a[, eval(my_call), by = 'id'])

#    id  a  b sa sb
# 1:  1 21 11 43 23
# 2:  1 22 12 43 23
# 3:  2 23 13 47 27
# 4:  2 24 14 47 27
# 5:  3 25 15 25 15

Upvotes: 0

eddi
eddi

Reputation: 49448

You can also do this:

a <- data.table(id=c(1,1,2,2,3),a=21:25,b=11:15,key="id")

a[, c("sa", "sb") := lapply(.SD, sum), by = id]

Or slightly more generally:

cols.to.sum = c("a", "b")
a[, paste0("s", cols.to.sum) := lapply(.SD, sum), by = id, .SDcols = cols.to.sum]

Upvotes: 11

Matt Dowle
Matt Dowle

Reputation: 59592

This is similar to :

How to generate a linear combination of variables and update table using data.table in a loop call?

but you want to combine this with by= too, so set() isn't flexible enough. That's a deliberate design design and set() is unlikely to change in that regard.

I sometimes use the EVAL helper at the end of that answer.
https://stackoverflow.com/a/20808573/403310
Some wince at that approach but I just think of it like constructing a dynamic SQL statement, which is quite common practice. The EVAL approach gives ultimate flexibility without head scratching about eval() and quote(). To see the dynamic query that's been constructed (to check it) you can add a print inside your EVAL helper function.

However, in this simple example you can wrap the LHS of := with brackets to tell data.table to lookup the value (clearer than with=FALSE), and the RHS needs a get().

for (n in c("a","b")) {
  s <- paste0("s",n)
  a[, (s) := sum(get(n)), by="id"]
}

Upvotes: 8

Related Questions