Reputation: 8646
if I understand correctly, duplicated()
function for data.table
returns a logical vector which doesn't contain first occurrence of duplicated record. What is the best way to mark this first occurrence as well? In case of base::duplicated()
, I solved this simply by disjunction with reversed order function: myDups <- (duplicated(x) | duplicated(x, fromLast=TRUE))
- but in data.table::duplicated()
, fromLast=TRUE
is not included (I don't know why)...
P.S. ok, here's a primitive example
myDT <- fread(
"id,fB,fC
1, b1,c1
2, b2,c2
3, b1,c1
4, b3,c3
5, b1,c1
")
setkeyv(myDT, c('fB', 'fC'))
myDT[, fD:=duplicated(myDT)]
rows 1, 3 and 5 are all duplicates but only 3 and 5 will be included in duplicated
while I need to mark all of them.
UPD. important notice: the answer I've accepted below works only for keyed table. If you want to find duplicate records considering all columns, you have to setkey
all these columns explicitly. So far I use the following workaround specifically for this case:
dups1 <- duplicated(myDT);
dups2 <- duplicated(myDT, fromLast=T);
dups <- dups1 | dups2;
Upvotes: 41
Views: 17315
Reputation: 49448
Many years ago this was the fastest answer by a large margin (see revision history if interested):
dups = duplicated(myDT, by = key(myDT));
myDT[, fD := dups | c(tail(dups, -1), FALSE)]
There have been a lot of internal changes since then however, that have made many options about the same order:
myDT <- data.table(id = sample(1e6),
fB = sample(seq_len(1e3), size= 1e6, replace=TRUE),
fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
setkey(myDT, fB, fC)
microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT, by = key(myDT)),fD:=.N>1],
dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)],
dup2 = {dups = duplicated(myDT, by = key(myDT)); myDT[, fD := dups | c(tail(dups, -1L), FALSE)]},
dup3 = {dups = duplicated(myDT, by = key(myDT)); myDT[, fD := dups | c(dups[-1L], FALSE)]},
times=10)
# expr min lq mean median uq max neval
# key 523.3568 567.5372 632.2379 578.1474 678.4399 886.8199 10
# unique 189.7692 196.0417 215.4985 210.5258 224.4306 290.2597 10
# dup 4440.8395 4685.1862 4786.6176 4752.8271 4900.4952 5148.3648 10
# dup2 143.2756 153.3738 236.4034 161.2133 318.1504 419.4082 10
# dup3 144.1497 150.9244 193.3058 166.9541 178.0061 460.5448 10
Upvotes: 16
Reputation: 15738
As of data.table version 1.9.8, the solution by eddi needs to be modified to be:
dups = duplicated(myDT, by = key(myDT));
myDT[, fD := dups | c(tail(dups, -1), FALSE)]
since:
Changes in v1.9.8 (on CRAN 25 Nov 2016)
POTENTIALLY BREAKING CHANGES
By default all columns are now used by unique(), duplicated() and uniqueN() data.table methods, #1284 and #1841. To restore old behaviour: options(datatable.old.unique.by.key=TRUE). In 1 year this option to restore the old default will be deprecated with warning. In 2 years the option will be removed. Please explicitly pass by=key(DT) for clarity. Only code that relies on the default is affected. 266 CRAN and Bioconductor packages using data.table were checked before release. 9 needed to change and were notified. Any lines of code without test coverage will have been missed by these checks. Any packages not on CRAN or Bioconductor were not checked.
Upvotes: 7
Reputation: 115475
A third approach, (that appears more efficient for this small example)
You can explicitly call duplicated.data.frame
....
myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)]
microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT),fD:=.N>1],
dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)])
## Unit: microseconds
## expr min lq median uq max neval
## key 556.608 575.9265 588.906 600.9795 27713.242 100
## unique 1112.913 1164.8310 1183.244 1216.9000 2263.557 100
## dup 420.173 436.3220 448.396 461.3750 699.986 100
If we expand the size of the sample data.table
, then the key
approach is the clear winner
myDT <- data.table(id = sample(1e6),
fB = sample(seq_len(1e3), size= 1e6, replace=TRUE),
fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
setkeyv(myDT, c('fB', 'fC'))
microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT),fD:=.N>1],
dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)],times=10)
## Unit: milliseconds
## expr min lq median uq max neval
## key 355.9258 358.1764 360.7628 450.9218 500.8360 10
## unique 451.3794 458.0258 483.3655 519.3341 553.2515 10
## dup 1690.1579 1721.5784 1775.5948 1826.0298 1845.4012 10
Upvotes: 7
Reputation: 15163
This appears to work:
> myDT[unique(myDT),fD:=.N>1]
> myDT
id fB fC fD
1: 1 b1 c1 TRUE
2: 3 b1 c1 TRUE
3: 5 b1 c1 TRUE
4: 2 b2 c2 FALSE
5: 4 b3 c3 FALSE
Thanks to @flodel, the better way to do it is this:
> myDT[, fD := .N > 1, by = key(myDT)]
> myDT
id fB fC fD
1: 1 b1 c1 TRUE
2: 3 b1 c1 TRUE
3: 5 b1 c1 TRUE
4: 2 b2 c2 FALSE
5: 4 b3 c3 FALSE
The difference in efficiency is substantial:
> microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT),fD:=.N>1])
Unit: microseconds
expr min lq median uq max neval
key 679.874 715.700 735.0575 773.7595 1825.437 100
unique 1417.845 1485.913 1522.7475 1567.9065 24053.645 100
Especially for the max. What's going on there?
Upvotes: 14