Removing duplicates based on common and different in values

Question

I have a data table like this

dt <- data.table(date=c('d1','d2','d3','d1','d2','d3'),v1=c('a','a','b','a','b','b'),v2=c(2,2,4,2,4,4))
   date v1 v2
1:   d1  a  2
2:   d2  a  2 <-need to remove this 
3:   d3  b  4 
4:   d1  a  2
5:   d2  b  4 <-need to remove this 
6:   d3  b  4

My actual data contains 16million rows and 5 columns which make up the unique condition, and one date column. I want to remove duplicates that have the same common values(in v1,v2) but only when their dates(date) are different.

Sample output

   date v1 v2
1:   d1  a  2
2:   d3  b  4
3:   d1  a  2
4:   d3  b  4

I tried "duplicated" function but unable to find the right method to remove duplicates. Appreciate any help.

Cath · Accepted Answer

If I "translate" correctly, you need either the rows that are not duplicated for variables v1 and v2 or the rows that are duplicated for those variables but also for variable date

dt[!duplicated(dt[, .(v1, v2)]) | 
   (duplicated(dt[, .(v1, v2)]) & duplicated(dt[, .(date, v1, v2)]))]
#   date v1 v2
#1:   d1  a  2
#2:   d3  b  4
#3:   d1  a  2
#4:   d3  b  4

As mentionned by @Arun, another preferable way, to avoid making a copy of dt, is to take advantage of by parameter of duplicated.data.table:

dt[!duplicated(dt, by=c("v1", "v2")) | 
   (duplicated(dt, by=c("v1", "v2")) & duplicated(dt, by=c("date", "v1", "v2")))]

Removing duplicates based on common and different in values

Answers (2)

Related Questions