Removing duplicates based on a specific category of another column

Question

I would like to remove duplicate IDs in my data using the Category columns. A subset of my data is as follows:

df <- data.frame(ID=c(1,2,3,4,1,4,2),
                 category=c("a","b","c","d","b","a","a"))
df

  ID category
1  1        a
2  2        b
3  3        c
4  4        d
5  1        b
6  4        a
7  2        a

If there is a duplicated ID from Category b , I need to keep it and remove the corresponding ID from other categories. And, I have no priority if the duplicated IDs are form other categories excluding Category b. So, my favorite outcome is:

  ID category
1  2        b
2  3        c
3  4        d
4  1        b

I have already read this post : R: Remove duplicates from a dataframe based on categories in a column but can't find my answer

akrun · Accepted Answer

We could do an arrange to that 'b' category rows are arranged at the top and then get the distinct rows by 'ID'

library(dplyr)
df %>%
     arrange(category != 'b') %>% 
     distinct(ID, .keep_all = TRUE)

-output

  ID category
1  2        b
2  1        b
3  3        c
4  4        d

Or using base R

df[order(df$category != 'b'), ] -> df1
df1[!duplicated(df1$ID), ]

Removing duplicates based on a specific category of another column

Answers (2)

Related Questions