Reputation: 770
I have a data table like this
dt <- data.table(date=c('d1','d2','d3','d1','d2','d3'),v1=c('a','a','b','a','b','b'),v2=c(2,2,4,2,4,4))
date v1 v2
1: d1 a 2
2: d2 a 2 <-need to remove this
3: d3 b 4
4: d1 a 2
5: d2 b 4 <-need to remove this
6: d3 b 4
My actual data contains 16million rows and 5 columns which make up the unique condition, and one date column. I want to remove duplicates that have the same common values(in v1,v2) but only when their dates(date) are different.
Sample output
date v1 v2
1: d1 a 2
2: d3 b 4
3: d1 a 2
4: d3 b 4
I tried "duplicated" function but unable to find the right method to remove duplicates. Appreciate any help.
Upvotes: 3
Views: 159
Reputation: 24074
If I "translate" correctly, you need either the rows that are not duplicated for variables v1
and v2
or the rows that are duplicated for those variables but also for variable date
dt[!duplicated(dt[, .(v1, v2)]) |
(duplicated(dt[, .(v1, v2)]) & duplicated(dt[, .(date, v1, v2)]))]
# date v1 v2
#1: d1 a 2
#2: d3 b 4
#3: d1 a 2
#4: d3 b 4
As mentionned by @Arun, another preferable way, to avoid making a copy of dt
, is to take advantage of by
parameter of duplicated.data.table
:
dt[!duplicated(dt, by=c("v1", "v2")) |
(duplicated(dt, by=c("v1", "v2")) & duplicated(dt, by=c("date", "v1", "v2")))]
Upvotes: 8
Reputation: 83215
The way I understood the question is that for each v1
/v2
combination only the dates with the highest number of occurences need to be preserved. To achieve that you could do:
dt[, n:=.N, by = .(date,v1,v2)][, .SD[n==max(n)], by = .(v1,v2)][, n:= NULL][]
which gives:
v1 v2 date
1: a 2 d1
2: a 2 d1
3: b 4 d3
4: b 4 d3
With the provided example data this gives the same result as @Cath and @Akrun.
However, with the following data (posted by @Cath in the comments of the question):
dt1 <- data.table(date=c('d1','d2','d3','d1','d2','d3','d4','d4'),
v1=c('a','a','b','a','b','b','a','b'),
v2=c(2,2,4,2,4,4,1,2))
Now @akrun's method will give a different result:
> dt1[!duplicated(dt1[, .(v1, v2)]) | (duplicated(dt1[, .(v1, v2)]) & duplicated(dt1[, .(date, v1, v2)])), ]
date v1 v2
1: d1 a 2
2: d3 b 4
3: d1 a 2
4: d3 b 4
5: d4 a 1
6: d4 b 2
> dt1[, if(uniqueN(.SD, by = c('v1', 'v2'))==1) .SD ,by = date]
date v1 v2
1: d1 a 2
2: d1 a 2
3: d3 b 4
4: d3 b 4
> dt1[, n:=.N, by = .(date,v1,v2)][, .SD[n==max(n)], by = .(v1,v2)][, n:= NULL][]
v1 v2 date
1: a 2 d1
2: a 2 d1
3: b 4 d3
4: b 4 d3
5: a 1 d4
6: b 2 d4
Some clearer direction about what OP wants are therefore welcome.
Upvotes: 4