pauljeba
pauljeba

Reputation: 770

Removing duplicates based on common and different in values

I have a data table like this

dt <- data.table(date=c('d1','d2','d3','d1','d2','d3'),v1=c('a','a','b','a','b','b'),v2=c(2,2,4,2,4,4))
   date v1 v2
1:   d1  a  2
2:   d2  a  2 <-need to remove this 
3:   d3  b  4 
4:   d1  a  2
5:   d2  b  4 <-need to remove this 
6:   d3  b  4

My actual data contains 16million rows and 5 columns which make up the unique condition, and one date column. I want to remove duplicates that have the same common values(in v1,v2) but only when their dates(date) are different.

Sample output

   date v1 v2
1:   d1  a  2
2:   d3  b  4
3:   d1  a  2
4:   d3  b  4

I tried "duplicated" function but unable to find the right method to remove duplicates. Appreciate any help.

Upvotes: 3

Views: 159

Answers (2)

Cath
Cath

Reputation: 24074

If I "translate" correctly, you need either the rows that are not duplicated for variables v1 and v2 or the rows that are duplicated for those variables but also for variable date

dt[!duplicated(dt[, .(v1, v2)]) | 
   (duplicated(dt[, .(v1, v2)]) & duplicated(dt[, .(date, v1, v2)]))]
#   date v1 v2
#1:   d1  a  2
#2:   d3  b  4
#3:   d1  a  2
#4:   d3  b  4

As mentionned by @Arun, another preferable way, to avoid making a copy of dt, is to take advantage of by parameter of duplicated.data.table:

dt[!duplicated(dt, by=c("v1", "v2")) | 
   (duplicated(dt, by=c("v1", "v2")) & duplicated(dt, by=c("date", "v1", "v2")))]

Upvotes: 8

Jaap
Jaap

Reputation: 83215

The way I understood the question is that for each v1/v2 combination only the dates with the highest number of occurences need to be preserved. To achieve that you could do:

dt[, n:=.N, by = .(date,v1,v2)][, .SD[n==max(n)], by = .(v1,v2)][, n:= NULL][]

which gives:

   v1 v2 date
1:  a  2   d1
2:  a  2   d1
3:  b  4   d3
4:  b  4   d3

With the provided example data this gives the same result as @Cath and @Akrun.


However, with the following data (posted by @Cath in the comments of the question):

dt1 <- data.table(date=c('d1','d2','d3','d1','d2','d3','d4','d4'), 
                  v1=c('a','a','b','a','b','b','a','b'),
                  v2=c(2,2,4,2,4,4,1,2))

Now @akrun's method will give a different result:

> dt1[!duplicated(dt1[, .(v1, v2)]) | (duplicated(dt1[, .(v1, v2)]) & duplicated(dt1[, .(date, v1, v2)])), ]
   date v1 v2
1:   d1  a  2
2:   d3  b  4
3:   d1  a  2
4:   d3  b  4
5:   d4  a  1
6:   d4  b  2
> dt1[, if(uniqueN(.SD, by = c('v1', 'v2'))==1) .SD ,by = date]
   date v1 v2
1:   d1  a  2
2:   d1  a  2
3:   d3  b  4
4:   d3  b  4
> dt1[, n:=.N, by = .(date,v1,v2)][, .SD[n==max(n)], by = .(v1,v2)][, n:= NULL][]
   v1 v2 date
1:  a  2   d1
2:  a  2   d1
3:  b  4   d3
4:  b  4   d3
5:  a  1   d4
6:  b  2   d4

Some clearer direction about what OP wants are therefore welcome.

Upvotes: 4

Related Questions