Vasily A
Vasily A

Reputation: 8646

find *all* duplicated records in data.table (not all-but-one)

if I understand correctly, duplicated() function for data.table returns a logical vector which doesn't contain first occurrence of duplicated record. What is the best way to mark this first occurrence as well? In case of base::duplicated(), I solved this simply by disjunction with reversed order function: myDups <- (duplicated(x) | duplicated(x, fromLast=TRUE)) - but in data.table::duplicated(), fromLast=TRUE is not included (I don't know why)...

P.S. ok, here's a primitive example

myDT <- fread(
"id,fB,fC
 1, b1,c1
 2, b2,c2
 3, b1,c1
 4, b3,c3
 5, b1,c1
")
setkeyv(myDT, c('fB', 'fC'))
myDT[, fD:=duplicated(myDT)]

rows 1, 3 and 5 are all duplicates but only 3 and 5 will be included in duplicated while I need to mark all of them.

UPD. important notice: the answer I've accepted below works only for keyed table. If you want to find duplicate records considering all columns, you have to setkey all these columns explicitly. So far I use the following workaround specifically for this case:

dups1 <- duplicated(myDT);
dups2 <- duplicated(myDT, fromLast=T);
dups <- dups1 | dups2;

Upvotes: 41

Views: 17315

Answers (4)

eddi
eddi

Reputation: 49448

Many years ago this was the fastest answer by a large margin (see revision history if interested):

dups = duplicated(myDT, by = key(myDT));
myDT[, fD := dups | c(tail(dups, -1), FALSE)]

There have been a lot of internal changes since then however, that have made many options about the same order:

myDT <- data.table(id = sample(1e6), 
                   fB = sample(seq_len(1e3), size= 1e6, replace=TRUE), 
                   fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
setkey(myDT, fB, fC)

microbenchmark(
   key=myDT[, fD := .N > 1, by = key(myDT)],
   unique=myDT[unique(myDT, by = key(myDT)),fD:=.N>1], 
   dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
              .SDcols = key(myDT)],
   dup2 = {dups = duplicated(myDT, by = key(myDT)); myDT[, fD := dups | c(tail(dups, -1L), FALSE)]},
   dup3 = {dups = duplicated(myDT, by = key(myDT)); myDT[, fD := dups | c(dups[-1L], FALSE)]},
   times=10)

#   expr       min        lq      mean    median        uq       max neval
#    key  523.3568  567.5372  632.2379  578.1474  678.4399  886.8199    10
# unique  189.7692  196.0417  215.4985  210.5258  224.4306  290.2597    10
#    dup 4440.8395 4685.1862 4786.6176 4752.8271 4900.4952 5148.3648    10
#   dup2  143.2756  153.3738  236.4034  161.2133  318.1504  419.4082    10
#   dup3  144.1497  150.9244  193.3058  166.9541  178.0061  460.5448    10

Upvotes: 16

Alex
Alex

Reputation: 15738

As of data.table version 1.9.8, the solution by eddi needs to be modified to be:

dups = duplicated(myDT, by = key(myDT));
myDT[, fD := dups | c(tail(dups, -1), FALSE)]

since:

Changes in v1.9.8 (on CRAN 25 Nov 2016)

POTENTIALLY BREAKING CHANGES

By default all columns are now used by unique(), duplicated() and uniqueN() data.table methods, #1284 and #1841. To restore old behaviour: options(datatable.old.unique.by.key=TRUE). In 1 year this option to restore the old default will be deprecated with warning. In 2 years the option will be removed. Please explicitly pass by=key(DT) for clarity. Only code that relies on the default is affected. 266 CRAN and Bioconductor packages using data.table were checked before release. 9 needed to change and were notified. Any lines of code without test coverage will have been missed by these checks. Any packages not on CRAN or Bioconductor were not checked.

Upvotes: 7

mnel
mnel

Reputation: 115475

A third approach, (that appears more efficient for this small example)

You can explicitly call duplicated.data.frame....

myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
  .SDcols = key(myDT)]


 microbenchmark(
   key=myDT[, fD := .N > 1, by = key(myDT)],
   unique=myDT[unique(myDT),fD:=.N>1], 
  dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE), 
    .SDcols = key(myDT)])
## Unit: microseconds
##    expr      min        lq   median        uq       max neval
##     key  556.608  575.9265  588.906  600.9795 27713.242   100
##  unique 1112.913 1164.8310 1183.244 1216.9000  2263.557   100
##     dup  420.173  436.3220  448.396  461.3750   699.986   100

If we expand the size of the sample data.table, then the key approach is the clear winner

 myDT <- data.table(id = sample(1e6), 
  fB = sample(seq_len(1e3), size= 1e6, replace=TRUE), 
  fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
setkeyv(myDT, c('fB', 'fC'))

microbenchmark(
  key=myDT[, fD := .N > 1, by = key(myDT)],
  unique=myDT[unique(myDT),fD:=.N>1], 
  dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
   .SDcols = key(myDT)],times=10)
## Unit: milliseconds
##    expr       min        lq    median        uq       max neval
##     key  355.9258  358.1764  360.7628  450.9218  500.8360    10
##  unique  451.3794  458.0258  483.3655  519.3341  553.2515    10
##     dup 1690.1579 1721.5784 1775.5948 1826.0298 1845.4012    10

Upvotes: 7

mrip
mrip

Reputation: 15163

This appears to work:

> myDT[unique(myDT),fD:=.N>1]
> myDT
   id  fB fC    fD
1:  1  b1 c1  TRUE
2:  3  b1 c1  TRUE
3:  5  b1 c1  TRUE
4:  2  b2 c2 FALSE
5:  4  b3 c3 FALSE

Thanks to @flodel, the better way to do it is this:

> myDT[, fD := .N > 1, by = key(myDT)]
> myDT
   id  fB fC    fD
1:  1  b1 c1  TRUE
2:  3  b1 c1  TRUE
3:  5  b1 c1  TRUE
4:  2  b2 c2 FALSE
5:  4  b3 c3 FALSE

The difference in efficiency is substantial:

> microbenchmark(
    key=myDT[, fD := .N > 1, by = key(myDT)],
    unique=myDT[unique(myDT),fD:=.N>1])
Unit: microseconds
   expr      min       lq    median        uq       max neval
    key  679.874  715.700  735.0575  773.7595  1825.437   100
 unique 1417.845 1485.913 1522.7475 1567.9065 24053.645   100

Especially for the max. What's going on there?

Upvotes: 14

Related Questions