Reputation: 283
I would like to improve my piece of code. Let's say you want to remove duplicate rows that have the same 'label' and 'id'. The way I do it is:
library(data.table)
dt <- data.table(label = c("A", "A", "B", "B", "C", "A", "A", "A"),
id = c(1, 1, 2, 2, 3, 4, 5, 5))
tmp = dt[label == 'A',]
tmp = unique(tmp, by = 'id')
dt = dt[label != 'A',]
dt = rbind(dt, tmp)
Is there a smarter/shorter way to accomplish that? If possible by reference? This code looks very ugly and implies a lot of copies.
(Moreover I have to do this operation for a few labels, but not all of them. So this implies 4 lines for every label...)
Thanks !
Example:
label id
A 1
A 1
B 2
B 2
C 3
A 4
A 5
A 5
Would give :
label id
A 1
B 2
B 2
C 3
A 4
A 5
Note that line 3 and 4 stay duplicated since the label is equal to 'B' and not to 'A'.
Upvotes: 0
Views: 240
Reputation: 1149
There is no need to create tmp
and then rbind
it again. You can simply use the duplicated
function as follows:
dt[label != "A" | !duplicated(dt, by=c("label", "id"))]
# label id
# 1: A 1
# 2: B 2
# 3: B 2
# 4: C 3
# 5: A 4
# 6: A 5
If you want to do this over several labels:
dt[!label %in% c("A", "C") | !duplicated(dt, by=c("label", "id"))]
See ?duplicated
to learn more about de-duplication functions in data.table
.
Upvotes: 3
Reputation: 887611
This could be also done using an if/else
condition
dt[, if(all(label=='A')) .SD[1L] else .SD, by = id]
# id label
#1: 1 A
#2: 2 B
#3: 2 B
#4: 3 C
#5: 4 A
#6: 5 A
Upvotes: 1