Reputation: 1127
I aggregated a data.table by a column, and set that as a key, then was surprised to find that the table still contained duplicated rows. What is the reason for this?
My table was special in that I had two columns with exactly the same values (but had to keep both for a practical reason), and I aggregated the table by one of those.
A simple example:
> library(data.table)
> dat = data.table(
+ class1 = c('a', 'a', 'b'),
+ class2 = c('a', 'a', 'b'),
+ value = 1:3)
> aggr = dat[, list(class2, sum(value)), keyby = class1]
> stopifnot(!any(duplicated(aggr)))
Error: !any(duplicated(aggr)) is not TRUE
Upvotes: 1
Views: 228
Reputation: 1127
If you use an aggregation function for all columns, then you get the expected result, without duplicated rows:
> library(data.table)
> dat = data.table(
+ class1 = c('a', 'a', 'b'),
+ class2 = c('a', 'a', 'b'),
+ value = 1:3)
> aggr = dat[, list(class2[[1]], sum(value)), keyby = class1]
> stopifnot(!any(duplicated(aggr)))
Note that the difference is that I take the first element of the class2
column. Note that any other function that outputs one value works as well.
Upvotes: 1