Checking for conflicts in a set of duplicate rows of a dataframe

Question

I have a dataframe, df as follows which consists of duplicates in some columns (id and serial):

df <- data.frame(id = c('B19','B19','B19','B20','B20','B20'), 
                 serial = c('a','a','a','b','b','b'), 
                 code = c('IA284','IA114','IA114','IB290','IB485','IB290'))

df

#  id  serial  code
#  B19    a    IA284
#  B19    a    IA114
#  B19    a    IA114
#  B20    b    IB290
#  B20    b    IB485
#  B20    b    IB290

I could identify duplicated id and serial, with the following code:

df$comment[duplicated(df[,1:2])] <- "DUPLICATED"

But I would like to calculate which rows have repeated (duplicated) pair of id and serial and for each set of duplicate found, I then need to compare the code for it. In a separate column, I would like to print the set of duplicates found and the difference in code if any.

Example resulting df:

#  id  serial  code    Duplicate_Set    comment
#  B19    a    IA284       1               1
#  B19    a    IA114       1               2
#  B19    a    IA114       1               2
#  B20    b    IB290       2               2
#  B20    b    IB485       2               1
#  B20    b    IB290       2               2

Explanation: in the "Duplicate_Set", I would like to group the pair of repeating "id" and "serial" and they are assigned a number. In the "comment" column, I would like the code values to be compared for each duplicated set and print the number of repeated codes for each set. It is just an example df. If someone could help me print "unique" and "duplicated" instead of 1 and 2 in the "comment" column, then it will be great too.

Ronak Shah · Accepted Answer

Using dplyr, we can first create a unique number for each pair of id and Serial, then group_by each Duplicate_Set and code count the number of rows in the group.

library(dplyr)
df %>%
  mutate(Duplicate_Set = group_indices(., id, serial)) %>%
  group_by(Duplicate_Set, code) %>%
  mutate(comment = n())

#  id    serial code  Duplicate_Set comment
#                 
#1 B19   a      IA284             1       1
#2 B19   a      IA114             1       2
#3 B19   a      IA114             1       2
#4 B20   b      IB290             2       2
#5 B20   b      IB485             2       1
#6 B20   b      IB290             2       2

A shorter version suggested by @tmfmnk

df %>% 
  group_by(Duplicate_Set = group_indices(., id, serial)) %>% 
  add_count(code, name = "comment")

Checking for conflicts in a set of duplicate rows of a dataframe

Answers (2)

Related Questions