ruser
ruser

Reputation: 1599

Determine whether column values are unique in data.table

I a using a data.table to store data. I am trying to figure out whether certain columns in each row are unique. I want to add a column to the data.table that will hold the value "Duplicated Values" if there are duplicated values and be NA if there are no duplicated values. The names of the columns that I want to check for duplication are stored in a character vector. For example, I create my data.table:

tmpdt<-data.table(a=c(1,2,3,4,5), b=c(2,2,3,4,5), c=c(4,2,2,4,4), d=c(3,3,1,4,5))
> tmpdt
   a b c d
1: 1 2 4 3
2: 2 2 2 3
3: 3 3 2 1
4: 4 4 4 4
5: 5 5 4 5

I have another variable that indicates which columns I need to check for duplicates. It is important that I be able to store the column names in a character vector and not need to "know" them (because they will be passed as an argument to a function).

dupcheckcols<-c("a", "c", "d")

I want the output to be:

> tmpdt
   a b c d     Dups
1: 1 2 4 3     <NA>
2: 2 2 2 3 Has Dups
3: 3 3 2 1     <NA>
4: 4 4 4 4 Has Dups
5: 5 5 4 5 Has Dups

If I were using a data.frame, this is easy. I could simply use:

tmpdt<-data.frame(a=c(1,2,3,4,5), b=c(2,2,3,4,5), c=c(4,2,2,4,4), d=c(3,3,1,4,5))
tmpdt$Dups<-NA
tmpdt$Dups[apply(tmpdt[,dupcheckcols], 1, function(x) {return(sum(duplicated(x))>0)})]<-"Has Dups"
> tmpdt
  a b c d     Dups
1 1 2 4 3     <NA>
2 2 2 2 3 Has Dups
3 3 3 2 1     <NA>
4 4 4 4 4 Has Dups
5 5 5 4 5 Has Dups

But I can't figure out how to accomplish the same task with a data.table. Any help is greatly appreciated.

Upvotes: 4

Views: 244

Answers (5)

Frank
Frank

Reputation: 66819

I found a way to do this with Rcpp, following an example by hadley (under "Sets"):

// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
#include <unordered_set>
using namespace Rcpp;

// [[Rcpp::export]]
LogicalVector anyDupCols(IntegerMatrix x) {
    int nr = x.nrow();
    int nc = x.ncol();
    LogicalVector out(nr, false);

    std::unordered_set<int> seen;
    for (int i = 0; i < nr; i++) {
        seen.clear();
        for (int j = 0; j < nc; j++){
            int xij = x(i,j);
            if (seen.count(xij)){ out[i] = true; break; }
            else seen.insert(xij);
        }
    }

    return out;
}

To use it, put it in a cpp file and run

library(Rcpp)
sourceCpp("anyDupCols.cpp")
anyDupCols(as.matrix(DT))

It does pretty well in benchmarks:

nc = 30
nv = nc^2
n  = 1e4

set.seed(1)
DT = setDT( replicate(nc, sample(nv, n, replace = TRUE), simplify=FALSE) )

library(microbenchmark)
microbenchmark(
    ananda = DT[, any(duplicated(unlist(.SD, use.names = FALSE))), by = 1:nrow(DT)]$V1,
    tospig = {
        expr = parse(text=paste(apply(t(combn(names(DT),2)),1,FUN = 
          function(x){ paste0(x, collapse="==") }), collapse = "|"))
        DT[, eval(expr)]
    },
    cpp = anyDupCols(as.matrix(DT)),
    alex = ff(DT),
    tscharf = apply(DT,1,function(row) any(duplicated(row))),
    unit = "relative", times = 10
)

Unit: relative
    expr      min       lq     mean   median       uq      max neval  cld
  ananda 2.462739 2.596990 2.774660 2.659898 2.869048 3.352547    10   c 
  tospig 3.118158 3.253102 3.606263 3.424598 3.885561 4.583268    10    d
     cpp 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10 a   
    alex 1.295415 1.927802 1.914883 1.982580 2.029868 2.538143    10  b  
 tscharf 2.112286 2.204654 2.385318 2.234963 2.322206 2.978047    10  bc 

If I go to nc = 50, @tospig's expr becomes too long for R to handle and I get node stack overflow, which is fun.

Upvotes: 3

alexis_laz
alexis_laz

Reputation: 13122

Another way is to tabulate "tmpdt" along its rows and find which rows have more than one of an element:

tmpdt2 = tmpdt[, dupcheckcols, with = FALSE] # subset tmpdt
colSums(table(unlist(tmpdt2), row(tmpdt2)) > 1L) > 0L
#    1     2     3     4     5 
#FALSE  TRUE FALSE  TRUE  TRUE 

Peeking at table we could speed it up significantly with something like:

ff = function(x)
{
    lvs = Reduce(union, lapply(x, function(X) if(is.factor(X)) levels(X) else unique(X)))
    x = lapply(x, function(X) match(X, lvs))
    nr = length(lvs); nc = length(x[[1L]])
    tabs = "dim<-"(tabulate(unlist(x, use.names = FALSE) + (0:(nc - 1L)) * nr, nr * nc), 
                   c(nr, nc))
    colSums(tabs > 1L) > 0L
}
ff(tmpdt2)
#[1] FALSE  TRUE FALSE  TRUE  TRUE

Upvotes: 1

T. Scharf
T. Scharf

Reputation: 4844

a one-liner with some elegance

  1. define the columns

  2. loop down the rows

  3. see if there are any dupes

tmpdt[,dups:=apply(.SD,1,function(row) any(duplicated(row))),.SDcols = dupcheckcols]

> tmpdt
   a b c d  dups
1: 1 2 4 3 FALSE
2: 2 2 2 3  TRUE
3: 3 3 2 1 FALSE
4: 4 4 4 4  TRUE
5: 5 5 4 5  TRUE

Upvotes: 1

tospig
tospig

Reputation: 8343

I'm sure there are other ways

tmpdt[, dups := tmpdt[, dupcheckcols, with=FALSE][, apply(.SD, 1, function(x){sum(duplicated(x))>0})] ]
#   a b c d  dups
#1: 1 2 4 3 FALSE
#2: 2 2 2 3  TRUE
#3: 3 3 2 1 FALSE
#4: 4 4 4 4  TRUE
#5: 5 5 4 5  TRUE

A more convoluted, but slightly quicker (in computational terms) method would be to construct the filter condition in i, then update in j by reference

expr <- paste(apply(t(combn(dupcheckcols,2)), 1, FUN=function(x){ paste0(x, collapse="==") }), collapse = "|")
# [1] "a==c|a==d|c==d"

expr <- parse(text=expr)
tmpdt[ eval(expr), dups := TRUE ]
#   a b c d dups
#1: 1 2 4 3   NA
#2: 2 2 2 3 TRUE
#3: 3 3 2 1   NA
#4: 4 4 4 4 TRUE
#5: 5 5 4 5 TRUE

I was interested in speed benefits, so I've benchmarked these two plus Ananda's solution:

library(microbenchmark)

tmpdt<-data.table(a=c(1,2,3,4,5), b=c(2,2,3,4,5), c=c(4,2,2,4,4), d=c(3,3,1,4,5))
t1 <- tmpdt
t2 <- tmpdt
t3 <- tmpdt

expr <- paste(apply(t(combn(dupcheckcols,2)), 1, FUN=function(x){ paste0(x, collapse="==") }), collapse = "|")
expr <- parse(text=expr)

microbenchmark(
#Ananda's solution
t1[, dups := any(duplicated(unlist(.SD))), by = 1:nrow(tmpdt), .SDcols = dupcheckcols],

t2[, dups := t2[, dupcheckcols, with=FALSE][, apply(.SD, 1, function(x){sum(duplicated(x))>0})] ],

t3[ eval(expr), dups := TRUE ]
)
 #     min        lq      mean   median        uq      max neval cld
 # 531.416  552.5760  577.0345  565.182  573.2015 1761.863   100  b 
 #1277.569 1333.2615 1389.5857 1358.021 1387.9860 2694.951   100   c
 # 265.872  283.3525  293.9362  292.487  301.1640  520.436   100 a  

Upvotes: 5

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

You should be able to do something like this:

tmpdt[, dups := any(duplicated(unlist(.SD, use.names = FALSE))), 
      by = 1:nrow(tmpdt), .SDcols = dupcheckcols]
tmpdt
#    a b c d  dups
# 1: 1 2 4 3 FALSE
# 2: 2 2 2 3  TRUE
# 3: 3 3 2 1 FALSE
# 4: 4 4 4 4  TRUE
# 5: 5 5 4 5  TRUE

Adjust accordingly if you really want the words "Has Dups", but note that it would probably be easier to use logical values, as in my answer here.

Upvotes: 3

Related Questions