Duffau
Duffau

Reputation: 491

R: Output of duplicated changes after setkey() in data.table

In the process of merging two datasets I was checking the data for duplicate entries, using the function duplicated. I get two different outputs whether I run duplicated before or after setkey(). Is this natural behaviour in data.table? In my (humble) opinion the number of duplicates should be unchanged by setting the key, which, in my understanding, is just a reordering and indexing of the data.table. Am I missing some crucial point?

Thanks alot!

Here is an example data.table:

> DT
   id x1 x2
1:  A  0  1
2:  A  1  1
3:  B  0  1
4:  B  1  0
5:  C  1  1
6:  C  0  0

Runnning duplicated on this unkeyed dataset I get the result of no duplicate entries, which seems in order.

duplicated(DT)
[1] FALSE FALSE FALSE FALSE FALSE FALSE

Then after setting the key with setkey() i get the following output,

setkey(DT,id)
duplicated(DT)
[1] FALSE  TRUE FALSE  TRUE FALSE  TRUE

where the function indicates 3 duplicates. I just don't get it...

Here is the code i used to generate the data.table

set.seed(123)
id <- rep(LETTERS[1:3],each=2)
x1 <- sample(c(0,1),6,T)
x2 <- sample(c(0,1),6,T)
DT <- data.table(id,x1,x2)

Upvotes: 1

Views: 384

Answers (1)

Duffau
Duffau

Reputation: 491

To get duplicated to use elements from each column, after setting one or multiple key columns, in the row-wise duplicate check, use

duplicated(DT,by=NULL)
> [1] FALSE FALSE FALSE FALSE FALSE FALSE

As the documentation states duplicated for data.table behaves differently from the base-version or when it is handling data.frames.

When you set the key with setkey() the function duplicated only checks the rows in the key-columns for duplicates. In the question only id is set as key so only the rows (so in this case, the elements in the column) of id are checked for duplicates.

If you specify the by argument in duplicated the function checks if the rows, with elements from each of the columns specified in by, have duplicated rows below in the table.

By setting by=NULL all columns are considered, and the function checks for row-wise duplicates, where the row vectors contain elements from all columns.This mimics the behaviour of duplicated when handling data.frames.

Upvotes: 2

Related Questions