data.table : remove duplicate subset of rows for a given index value

Question

I would like to improve my piece of code. Let's say you want to remove duplicate rows that have the same 'label' and 'id'. The way I do it is:

library(data.table)
dt <- data.table(label = c("A", "A", "B", "B", "C", "A", "A", "A"),
                 id = c(1, 1, 2, 2, 3, 4, 5, 5))
tmp = dt[label == 'A',]
tmp = unique(tmp, by = 'id')
dt = dt[label != 'A',]
dt = rbind(dt, tmp)

Is there a smarter/shorter way to accomplish that? If possible by reference? This code looks very ugly and implies a lot of copies.

(Moreover I have to do this operation for a few labels, but not all of them. So this implies 4 lines for every label...)

Thanks !

Example:

label id
A     1
A     1
B     2
B     2
C     3
A     4
A     5
A     5

Would give :

label id
A     1
B     2
B     2
C     3
A     4
A     5

Note that line 3 and 4 stay duplicated since the label is equal to 'B' and not to 'A'.

cocquemas · Accepted Answer

There is no need to create tmp and then rbind it again. You can simply use the duplicated function as follows:

dt[label != "A" | !duplicated(dt, by=c("label", "id"))]
#    label id
# 1:     A  1
# 2:     B  2
# 3:     B  2
# 4:     C  3
# 5:     A  4
# 6:     A  5

If you want to do this over several labels:

dt[!label %in% c("A", "C") | !duplicated(dt, by=c("label", "id"))]

See ?duplicated to learn more about de-duplication functions in data.table.

data.table : remove duplicate subset of rows for a given index value

Answers (2)

Related Questions