Reputation: 397
I have a dataframe, df as follows which consists of duplicates in some columns (id and serial):
df <- data.frame(id = c('B19','B19','B19','B20','B20','B20'),
serial = c('a','a','a','b','b','b'),
code = c('IA284','IA114','IA114','IB290','IB485','IB290'))
df
# id serial code
# B19 a IA284
# B19 a IA114
# B19 a IA114
# B20 b IB290
# B20 b IB485
# B20 b IB290
I could identify duplicated id and serial, with the following code:
df$comment[duplicated(df[,1:2])] <- "DUPLICATED"
But I would like to calculate which rows have repeated (duplicated) pair of id and serial and for each set of duplicate found, I then need to compare the code for it. In a separate column, I would like to print the set of duplicates found and the difference in code if any.
Example resulting df:
# id serial code Duplicate_Set comment
# B19 a IA284 1 1
# B19 a IA114 1 2
# B19 a IA114 1 2
# B20 b IB290 2 2
# B20 b IB485 2 1
# B20 b IB290 2 2
Explanation: in the "Duplicate_Set", I would like to group the pair of repeating "id" and "serial" and they are assigned a number. In the "comment" column, I would like the code values to be compared for each duplicated set and print the number of repeated codes for each set. It is just an example df. If someone could help me print "unique" and "duplicated" instead of 1 and 2 in the "comment" column, then it will be great too.
Upvotes: 0
Views: 104
Reputation: 887501
We can also use data.table
library(data.table)
setDT(df)[, Duplicate_Set := rleid(id, serial)][,
comment := .N, .(Duplicate_Set, code)][]
# id serial code Duplicate_Set comment
#1: B19 a IA284 1 1
#2: B19 a IA114 1 2
#3: B19 a IA114 1 2
#4: B20 b IB290 2 2
#5: B20 b IB485 2 1
#6: B20 b IB290 2 2
Upvotes: 0
Reputation: 389115
Using dplyr
, we can first create a unique number for each pair of id
and Serial
, then group_by
each Duplicate_Set
and code
count the number of rows in the group.
library(dplyr)
df %>%
mutate(Duplicate_Set = group_indices(., id, serial)) %>%
group_by(Duplicate_Set, code) %>%
mutate(comment = n())
# id serial code Duplicate_Set comment
# <fct> <fct> <fct> <int> <int>
#1 B19 a IA284 1 1
#2 B19 a IA114 1 2
#3 B19 a IA114 1 2
#4 B20 b IB290 2 2
#5 B20 b IB485 2 1
#6 B20 b IB290 2 2
A shorter version suggested by @tmfmnk
df %>%
group_by(Duplicate_Set = group_indices(., id, serial)) %>%
add_count(code, name = "comment")
Upvotes: 1