Aggregating a column by the same column

Question

Why the conventional syntax for column aggregation in data.table dt[, sum(x), by = "y"] doesn't work if we refer to the same column in j and by, i.e., dt[, sum(x), by = "x"]?

library(data.table)

set.seed(1)
dt <- data.table(x = sample(c(1:10), 20, T), y = sample(letters[1:4], 20, T))
setorderv(dt, "y")

I want to sum x by x but the following doesn't work, it merely replicates the x column:

> dt[, sum(x, na.rm = T), by = "x"]
     x V1
 1:  4  4
 2: 10 10
 3:  3  3
 4:  9  9
 5:  7  7
 6:  1  1
 7:  8  8
 8:  6  6
 9:  2  2
10:  5  5

Now if I do:

> dt[, .(res = lapply(.SD, sum, na.rm = T)), by = 'x', .SDcols = "x"] 
     x res
 1:  4  12
 2: 10  30
 3:  3   9
 4:  9   9
 5:  7  21
 6:  1   1
 7:  8  24
 8:  6   6
 9:  2   2
10:  5   5

That works.

On the other hand, the following aggregation works (as expected) if the by argument is a column different than the one used to aggregate in j:

> dt[, sum(x, na.rm = T), by = "y"]
   y V1
1: a 38
2: b 38
3: c 17
4: d 26

Frank · Accepted Answer

Every column in by= is reduced to a length of one, instead of redundantly duplicating the values out to the length of the group, .N, taking up more memory. You can still calculate the sum by multiplying by .N, though:

dt[, x*.N, by=x]

... or the full-length vector can be added to .SD (though this should never be necessary):

dt[, sum(.SD$x), by=x, .SDcols="x"]

.SD is the "Subset of Data" for a given by= group. See ?.SD or ?.N for details.

Aggregating a column by the same column

Answers (1)

Related Questions