Why do I get duplicated data.table rows after aggregation?

Question

I aggregated a data.table by a column, and set that as a key, then was surprised to find that the table still contained duplicated rows. What is the reason for this?

My table was special in that I had two columns with exactly the same values (but had to keep both for a practical reason), and I aggregated the table by one of those.

A simple example:

> library(data.table)
> dat = data.table(
+   class1 = c('a', 'a', 'b'), 
+   class2 = c('a', 'a', 'b'), 
+   value = 1:3)
> aggr = dat[, list(class2, sum(value)), keyby = class1]
> stopifnot(!any(duplicated(aggr)))
Error: !any(duplicated(aggr)) is not TRUE

jciloa · Accepted Answer

If you use an aggregation function for all columns, then you get the expected result, without duplicated rows:

> library(data.table)
> dat = data.table(
+   class1 = c('a', 'a', 'b'), 
+   class2 = c('a', 'a', 'b'), 
+   value = 1:3)
> aggr = dat[, list(class2[[1]], sum(value)), keyby = class1]
> stopifnot(!any(duplicated(aggr)))

Note that the difference is that I take the first element of the class2 column. Note that any other function that outputs one value works as well.

Why do I get duplicated data.table rows after aggregation?

Answers (1)

Related Questions