Remove duplicates, prioritising which rows to remove based on another column, in R

Question

temp <- data.frame("Id" = c(1,1,1,2,2,3,3,4,4,5,6,6,6), "country" = c("a", "f", "b","e", "d", "b", "f", "a", 'e',"a", "a","b","d"))

Say I have the data frame above. I want unique Id scores only, prioritizing countries in a, b and c. Id's such as '1' have countries in a and b. In this case, we don't care which row to retrieve (either is fine), but we don't want the row with country f. For Id 2, there are no countries in a,b or c, so we just take either of those rows. For Id 3, we just want country b.

The final dataframe that I want, therefore looks like this:

temp2 <- data.frame("Id" = c(1,2,3,4,5,6), "country" = c("a","e", "b", "a", "a", "a"))

Is there a neat way to do this?

Ronak Shah · Accepted Answer

We can write a function checking for interested countries and returning their row number based on if they are present or not.

interested <- c('a', 'b', 'c')

get_rows <- function(ctry, interested) {
   if(any(interested %in% ctry)) sample(which(ctry %in% interested), 1)
   else sample(seq_along(ctry), 1)
}

and apply it by group

library(dplyr)
set.seed(123)
temp %>% group_by(Id) %>% slice(get_rows(country, interested))

#     Id country
#     
#1     1 a      
#2     2 e      
#3     3 b      
#4     4 a      
#5     5 a      
#6     6 b

Remove duplicates, prioritising which rows to remove based on another column, in R

Answers (1)

Related Questions