JdeMello
JdeMello

Reputation: 1718

Aggregating a column by the same column

Why the conventional syntax for column aggregation in data.table dt[, sum(x), by = "y"] doesn't work if we refer to the same column in j and by, i.e., dt[, sum(x), by = "x"]?

library(data.table)

set.seed(1)
dt <- data.table(x = sample(c(1:10), 20, T), y = sample(letters[1:4], 20, T))
setorderv(dt, "y")

I want to sum x by x but the following doesn't work, it merely replicates the x column:

> dt[, sum(x, na.rm = T), by = "x"]
     x V1
 1:  4  4
 2: 10 10
 3:  3  3
 4:  9  9
 5:  7  7
 6:  1  1
 7:  8  8
 8:  6  6
 9:  2  2
10:  5  5

Now if I do:

> dt[, .(res = lapply(.SD, sum, na.rm = T)), by = 'x', .SDcols = "x"] 
     x res
 1:  4  12
 2: 10  30
 3:  3   9
 4:  9   9
 5:  7  21
 6:  1   1
 7:  8  24
 8:  6   6
 9:  2   2
10:  5   5

That works.

On the other hand, the following aggregation works (as expected) if the by argument is a column different than the one used to aggregate in j:

> dt[, sum(x, na.rm = T), by = "y"]
   y V1
1: a 38
2: b 38
3: c 17
4: d 26

Upvotes: 1

Views: 38

Answers (1)

Frank
Frank

Reputation: 66819

Every column in by= is reduced to a length of one, instead of redundantly duplicating the values out to the length of the group, .N, taking up more memory. You can still calculate the sum by multiplying by .N, though:

dt[, x*.N, by=x]

... or the full-length vector can be added to .SD (though this should never be necessary):

dt[, sum(.SD$x), by=x, .SDcols="x"]

.SD is the "Subset of Data" for a given by= group. See ?.SD or ?.N for details.

Upvotes: 2

Related Questions