Reputation: 491
In the process of merging two datasets I was checking the data for duplicate entries, using the function duplicated
. I get two different outputs whether I run duplicated
before or after setkey()
. Is this natural behaviour in data.table
? In my (humble) opinion the number of duplicates should be unchanged by setting the key, which, in my understanding, is just a reordering and indexing of the data.table
. Am I missing some crucial point?
Thanks alot!
Here is an example data.table
:
> DT
id x1 x2
1: A 0 1
2: A 1 1
3: B 0 1
4: B 1 0
5: C 1 1
6: C 0 0
Runnning duplicated
on this unkeyed dataset I get the result of no duplicate entries, which seems in order.
duplicated(DT)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
Then after setting the key with setkey()
i get the following output,
setkey(DT,id)
duplicated(DT)
[1] FALSE TRUE FALSE TRUE FALSE TRUE
where the function indicates 3 duplicates. I just don't get it...
Here is the code i used to generate the data.table
set.seed(123)
id <- rep(LETTERS[1:3],each=2)
x1 <- sample(c(0,1),6,T)
x2 <- sample(c(0,1),6,T)
DT <- data.table(id,x1,x2)
Upvotes: 1
Views: 384
Reputation: 491
To get duplicated
to use elements from each column, after setting one or multiple key columns, in the row-wise duplicate check, use
duplicated(DT,by=NULL)
> [1] FALSE FALSE FALSE FALSE FALSE FALSE
As the documentation states duplicated
for data.table
behaves differently from the base-version or when it is handling data.frames
.
When you set the key with setkey()
the function duplicated
only checks the rows in the key-columns for duplicates. In the question only id
is set as key so only the rows (so in this case, the elements in the column) of id
are checked for duplicates.
If you specify the by
argument in duplicated
the function checks if the rows, with elements from each of the columns specified in by
, have duplicated rows below in the table.
By setting by=NULL
all columns are considered, and the function checks for row-wise duplicates, where the row vectors contain elements from all columns.This mimics the behaviour of duplicated
when handling data.frames
.
Upvotes: 2