Reputation: 1718
Why the conventional syntax for column aggregation in data.table
dt[, sum(x), by = "y"]
doesn't work if we refer to the same column in j
and by
, i.e., dt[, sum(x), by = "x"]
?
library(data.table)
set.seed(1)
dt <- data.table(x = sample(c(1:10), 20, T), y = sample(letters[1:4], 20, T))
setorderv(dt, "y")
I want to sum x
by x
but the following doesn't work, it merely replicates the x
column:
> dt[, sum(x, na.rm = T), by = "x"]
x V1
1: 4 4
2: 10 10
3: 3 3
4: 9 9
5: 7 7
6: 1 1
7: 8 8
8: 6 6
9: 2 2
10: 5 5
Now if I do:
> dt[, .(res = lapply(.SD, sum, na.rm = T)), by = 'x', .SDcols = "x"]
x res
1: 4 12
2: 10 30
3: 3 9
4: 9 9
5: 7 21
6: 1 1
7: 8 24
8: 6 6
9: 2 2
10: 5 5
That works.
On the other hand, the following aggregation works (as expected) if the by
argument is a column different than the one used to aggregate in j
:
> dt[, sum(x, na.rm = T), by = "y"]
y V1
1: a 38
2: b 38
3: c 17
4: d 26
Upvotes: 1
Views: 38
Reputation: 66819
Every column in by=
is reduced to a length of one, instead of redundantly duplicating the values out to the length of the group, .N
, taking up more memory. You can still calculate the sum by multiplying by .N
, though:
dt[, x*.N, by=x]
... or the full-length vector can be added to .SD
(though this should never be necessary):
dt[, sum(.SD$x), by=x, .SDcols="x"]
.SD
is the "Subset of Data" for a given by=
group. See ?.SD
or ?.N
for details.
Upvotes: 2