Reputation: 31
temp <- data.frame("Id" = c(1,1,1,2,2,3,3,4,4,5,6,6,6), "country" = c("a", "f", "b","e", "d", "b", "f", "a", 'e',"a", "a","b","d"))
Say I have the data frame above. I want unique Id scores only, prioritizing countries in a, b and c. Id's such as '1' have countries in a and b. In this case, we don't care which row to retrieve (either is fine), but we don't want the row with country f. For Id 2, there are no countries in a,b or c, so we just take either of those rows. For Id 3, we just want country b.
The final dataframe that I want, therefore looks like this:
temp2 <- data.frame("Id" = c(1,2,3,4,5,6), "country" = c("a","e", "b", "a", "a", "a"))
Is there a neat way to do this?
Upvotes: 1
Views: 67
Reputation: 388992
We can write a function checking for interested countries and returning their row number based on if they are present or not.
interested <- c('a', 'b', 'c')
get_rows <- function(ctry, interested) {
if(any(interested %in% ctry)) sample(which(ctry %in% interested), 1)
else sample(seq_along(ctry), 1)
}
and apply it by group
library(dplyr)
set.seed(123)
temp %>% group_by(Id) %>% slice(get_rows(country, interested))
# Id country
# <dbl> <fct>
#1 1 a
#2 2 e
#3 3 b
#4 4 a
#5 5 a
#6 6 b
Upvotes: 1