Hanif
Hanif

Reputation: 397

Checking for conflicts in a set of duplicate rows of a dataframe

I have a dataframe, df as follows which consists of duplicates in some columns (id and serial):

df <- data.frame(id = c('B19','B19','B19','B20','B20','B20'), 
                 serial = c('a','a','a','b','b','b'), 
                 code = c('IA284','IA114','IA114','IB290','IB485','IB290'))

df

#  id  serial  code
#  B19    a    IA284
#  B19    a    IA114
#  B19    a    IA114
#  B20    b    IB290
#  B20    b    IB485
#  B20    b    IB290

I could identify duplicated id and serial, with the following code:

df$comment[duplicated(df[,1:2])] <- "DUPLICATED"

But I would like to calculate which rows have repeated (duplicated) pair of id and serial and for each set of duplicate found, I then need to compare the code for it. In a separate column, I would like to print the set of duplicates found and the difference in code if any.

Example resulting df:

#  id  serial  code    Duplicate_Set    comment
#  B19    a    IA284       1               1
#  B19    a    IA114       1               2
#  B19    a    IA114       1               2
#  B20    b    IB290       2               2
#  B20    b    IB485       2               1
#  B20    b    IB290       2               2

Explanation: in the "Duplicate_Set", I would like to group the pair of repeating "id" and "serial" and they are assigned a number. In the "comment" column, I would like the code values to be compared for each duplicated set and print the number of repeated codes for each set. It is just an example df. If someone could help me print "unique" and "duplicated" instead of 1 and 2 in the "comment" column, then it will be great too.

Upvotes: 0

Views: 104

Answers (2)

akrun
akrun

Reputation: 887501

We can also use data.table

library(data.table)
setDT(df)[, Duplicate_Set := rleid(id, serial)][, 
      comment := .N, .(Duplicate_Set, code)][]
#    id serial  code Duplicate_Set comment
#1: B19      a IA284             1       1
#2: B19      a IA114             1       2
#3: B19      a IA114             1       2
#4: B20      b IB290             2       2
#5: B20      b IB485             2       1
#6: B20      b IB290             2       2

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 389115

Using dplyr, we can first create a unique number for each pair of id and Serial, then group_by each Duplicate_Set and code count the number of rows in the group.

library(dplyr)
df %>%
  mutate(Duplicate_Set = group_indices(., id, serial)) %>%
  group_by(Duplicate_Set, code) %>%
  mutate(comment = n())

#  id    serial code  Duplicate_Set comment
#  <fct> <fct>  <fct>         <int>   <int>
#1 B19   a      IA284             1       1
#2 B19   a      IA114             1       2
#3 B19   a      IA114             1       2
#4 B20   b      IB290             2       2
#5 B20   b      IB485             2       1
#6 B20   b      IB290             2       2

A shorter version suggested by @tmfmnk

df %>% 
  group_by(Duplicate_Set = group_indices(., id, serial)) %>% 
  add_count(code, name = "comment")

Upvotes: 1

Related Questions