Reputation: 1599
I a using a data.table to store data. I am trying to figure out whether certain columns in each row are unique. I want to add a column to the data.table that will hold the value "Duplicated Values" if there are duplicated values and be NA if there are no duplicated values. The names of the columns that I want to check for duplication are stored in a character vector. For example, I create my data.table:
tmpdt<-data.table(a=c(1,2,3,4,5), b=c(2,2,3,4,5), c=c(4,2,2,4,4), d=c(3,3,1,4,5))
> tmpdt
a b c d
1: 1 2 4 3
2: 2 2 2 3
3: 3 3 2 1
4: 4 4 4 4
5: 5 5 4 5
I have another variable that indicates which columns I need to check for duplicates. It is important that I be able to store the column names in a character vector and not need to "know" them (because they will be passed as an argument to a function).
dupcheckcols<-c("a", "c", "d")
I want the output to be:
> tmpdt
a b c d Dups
1: 1 2 4 3 <NA>
2: 2 2 2 3 Has Dups
3: 3 3 2 1 <NA>
4: 4 4 4 4 Has Dups
5: 5 5 4 5 Has Dups
If I were using a data.frame, this is easy. I could simply use:
tmpdt<-data.frame(a=c(1,2,3,4,5), b=c(2,2,3,4,5), c=c(4,2,2,4,4), d=c(3,3,1,4,5))
tmpdt$Dups<-NA
tmpdt$Dups[apply(tmpdt[,dupcheckcols], 1, function(x) {return(sum(duplicated(x))>0)})]<-"Has Dups"
> tmpdt
a b c d Dups
1 1 2 4 3 <NA>
2 2 2 2 3 Has Dups
3 3 3 2 1 <NA>
4 4 4 4 4 Has Dups
5 5 5 4 5 Has Dups
But I can't figure out how to accomplish the same task with a data.table. Any help is greatly appreciated.
Upvotes: 4
Views: 244
Reputation: 66819
I found a way to do this with Rcpp, following an example by hadley (under "Sets"):
// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
#include <unordered_set>
using namespace Rcpp;
// [[Rcpp::export]]
LogicalVector anyDupCols(IntegerMatrix x) {
int nr = x.nrow();
int nc = x.ncol();
LogicalVector out(nr, false);
std::unordered_set<int> seen;
for (int i = 0; i < nr; i++) {
seen.clear();
for (int j = 0; j < nc; j++){
int xij = x(i,j);
if (seen.count(xij)){ out[i] = true; break; }
else seen.insert(xij);
}
}
return out;
}
To use it, put it in a cpp file and run
library(Rcpp)
sourceCpp("anyDupCols.cpp")
anyDupCols(as.matrix(DT))
It does pretty well in benchmarks:
nc = 30
nv = nc^2
n = 1e4
set.seed(1)
DT = setDT( replicate(nc, sample(nv, n, replace = TRUE), simplify=FALSE) )
library(microbenchmark)
microbenchmark(
ananda = DT[, any(duplicated(unlist(.SD, use.names = FALSE))), by = 1:nrow(DT)]$V1,
tospig = {
expr = parse(text=paste(apply(t(combn(names(DT),2)),1,FUN =
function(x){ paste0(x, collapse="==") }), collapse = "|"))
DT[, eval(expr)]
},
cpp = anyDupCols(as.matrix(DT)),
alex = ff(DT),
tscharf = apply(DT,1,function(row) any(duplicated(row))),
unit = "relative", times = 10
)
Unit: relative
expr min lq mean median uq max neval cld
ananda 2.462739 2.596990 2.774660 2.659898 2.869048 3.352547 10 c
tospig 3.118158 3.253102 3.606263 3.424598 3.885561 4.583268 10 d
cpp 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
alex 1.295415 1.927802 1.914883 1.982580 2.029868 2.538143 10 b
tscharf 2.112286 2.204654 2.385318 2.234963 2.322206 2.978047 10 bc
If I go to nc = 50
, @tospig's expr
becomes too long for R to handle and I get node stack overflow
, which is fun.
Upvotes: 3
Reputation: 13122
Another way is to tabulate "tmpdt" along its rows and find which rows have more than one of an element:
tmpdt2 = tmpdt[, dupcheckcols, with = FALSE] # subset tmpdt
colSums(table(unlist(tmpdt2), row(tmpdt2)) > 1L) > 0L
# 1 2 3 4 5
#FALSE TRUE FALSE TRUE TRUE
Peeking at table
we could speed it up significantly with something like:
ff = function(x)
{
lvs = Reduce(union, lapply(x, function(X) if(is.factor(X)) levels(X) else unique(X)))
x = lapply(x, function(X) match(X, lvs))
nr = length(lvs); nc = length(x[[1L]])
tabs = "dim<-"(tabulate(unlist(x, use.names = FALSE) + (0:(nc - 1L)) * nr, nr * nc),
c(nr, nc))
colSums(tabs > 1L) > 0L
}
ff(tmpdt2)
#[1] FALSE TRUE FALSE TRUE TRUE
Upvotes: 1
Reputation: 4844
a one-liner with some elegance
define the columns
loop down the rows
see if there are any dupes
tmpdt[,dups:=apply(.SD,1,function(row) any(duplicated(row))),.SDcols = dupcheckcols]
> tmpdt
a b c d dups
1: 1 2 4 3 FALSE
2: 2 2 2 3 TRUE
3: 3 3 2 1 FALSE
4: 4 4 4 4 TRUE
5: 5 5 4 5 TRUE
Upvotes: 1
Reputation: 8343
I'm sure there are other ways
tmpdt[, dups := tmpdt[, dupcheckcols, with=FALSE][, apply(.SD, 1, function(x){sum(duplicated(x))>0})] ]
# a b c d dups
#1: 1 2 4 3 FALSE
#2: 2 2 2 3 TRUE
#3: 3 3 2 1 FALSE
#4: 4 4 4 4 TRUE
#5: 5 5 4 5 TRUE
A more convoluted, but slightly quicker (in computational terms) method would be to construct the filter condition in i
, then update in j
by reference
expr <- paste(apply(t(combn(dupcheckcols,2)), 1, FUN=function(x){ paste0(x, collapse="==") }), collapse = "|")
# [1] "a==c|a==d|c==d"
expr <- parse(text=expr)
tmpdt[ eval(expr), dups := TRUE ]
# a b c d dups
#1: 1 2 4 3 NA
#2: 2 2 2 3 TRUE
#3: 3 3 2 1 NA
#4: 4 4 4 4 TRUE
#5: 5 5 4 5 TRUE
I was interested in speed benefits, so I've benchmarked these two plus Ananda's solution:
library(microbenchmark)
tmpdt<-data.table(a=c(1,2,3,4,5), b=c(2,2,3,4,5), c=c(4,2,2,4,4), d=c(3,3,1,4,5))
t1 <- tmpdt
t2 <- tmpdt
t3 <- tmpdt
expr <- paste(apply(t(combn(dupcheckcols,2)), 1, FUN=function(x){ paste0(x, collapse="==") }), collapse = "|")
expr <- parse(text=expr)
microbenchmark(
#Ananda's solution
t1[, dups := any(duplicated(unlist(.SD))), by = 1:nrow(tmpdt), .SDcols = dupcheckcols],
t2[, dups := t2[, dupcheckcols, with=FALSE][, apply(.SD, 1, function(x){sum(duplicated(x))>0})] ],
t3[ eval(expr), dups := TRUE ]
)
# min lq mean median uq max neval cld
# 531.416 552.5760 577.0345 565.182 573.2015 1761.863 100 b
#1277.569 1333.2615 1389.5857 1358.021 1387.9860 2694.951 100 c
# 265.872 283.3525 293.9362 292.487 301.1640 520.436 100 a
Upvotes: 5
Reputation: 193687
You should be able to do something like this:
tmpdt[, dups := any(duplicated(unlist(.SD, use.names = FALSE))),
by = 1:nrow(tmpdt), .SDcols = dupcheckcols]
tmpdt
# a b c d dups
# 1: 1 2 4 3 FALSE
# 2: 2 2 2 3 TRUE
# 3: 3 3 2 1 FALSE
# 4: 4 4 4 4 TRUE
# 5: 5 5 4 5 TRUE
Adjust accordingly if you really want the words "Has Dups", but note that it would probably be easier to use logical values, as in my answer here.
Upvotes: 3