chippycentra
chippycentra

Reputation: 3432

Using dplyr to select groups depending in ratio of item present into a list

Hello I have a df such as

Groups COL1
G1 horse
G1 donkey
G1 unknown
G1 snake 
G1 horse 
G2 dog
G2 dog
G2 unknown
G2 unknown
G3 donkey
G3 dog
G4 Mule
G4 dog
G4 cat 
G4 cat
G5 mule
G5 donkey
G5 mule

and a list

list_not_accepted=c("horse","donkey","mule")

so basically the idea would be to select only groups where the number of COL1 element present in list_not_accepted / number of all COL1 value <0.6.

so here :

at the end we should get a df such as :

Groups COL1
G2 dog
G2 dog
G2 unknown
G2 unknown
G3 donkey
G3 dog
G4 Mule
G4 dog
G4 cat 
G4 cat

here are the data

> dput(tabl)
structure(list(Groups = c("G1", "G1", "G1", "G1", "G1", "G2", 
"G2", "G2", "G2", "G3", "G3", "G4", "G4", "G4", "G4", "G5", "G5", 
"G5"), COL1 = c("horse", "donkey", "unknown", "snake", "horse", 
"dog", "dog", "unknown", "unknown", "donkey", "dog", "Mule", 
"dog", "cat", "cat", "mule", "donkey", "mule")), row.names = c(NA, 
-18L), class = "data.frame")

Does someone have an idea please ? thank you very much !

Upvotes: 2

Views: 44

Answers (2)

Darren Tsai
Darren Tsai

Reputation: 35554

A dplyr solution with filter():

library(dplyr)

df %>%
  group_by(Groups) %>%
  filter(sum(tolower(COL1) %in% list_not_accepted) / n() < 0.6)

# A tibble: 10 x 2
# Groups:   Groups [3]
#    Groups COL1   
#    <chr>  <chr>  
#  1 G2     dog    
#  2 G2     dog    
#  3 G2     unknown
#  4 G2     unknown
#  5 G3     donkey 
#  6 G3     dog    
#  7 G4     Mule   
#  8 G4     dog    
#  9 G4     cat    
# 10 G4     cat

The first element in G4 is "Mule". In your description it should match "mule" of list_not_accepted, so I turn all COL1 to lower cases before matching.

Upvotes: 2

Duck
Duck

Reputation: 39595

I would suggest next approach where you can join the original data with small dataframes where you can compute the ratios and make the filter. We have used an inner_join() to add the results from filtered groups according to threshold. We also use left_join() in order to merge the groups with the number of cases in COL1 and the number of cases after applying the filter with list_not_accepted. With those results, ratio variable is computed and the filter is applied. Finally, the expected output is reached. Here the code:

library(tidyverse)
#Data
list_not_accepted=c("horse","donkey","mule")
#Compute
#Join data
mydf %>% 
  inner_join(mydf %>%
              #Compute n obs across all groups
              group_by(Groups) %>%
              summarise(N=n()) %>% 
              #Lef join with n obs based on vector list_not_accepted
              left_join(mydf %>% 
                          filter(COL1 %in% list_not_accepted) %>%
                          group_by(Groups) %>%
                          summarise(N1=n())) %>% replace(is.na(.),0) %>%
              #Compute ratio
              mutate(Ratio=N1/N) %>%
              filter(Ratio<0.6)) %>% select(Groups,COL1)

Output:

   Groups    COL1
1      G2     dog
2      G2     dog
3      G2 unknown
4      G2 unknown
5      G3  donkey
6      G3     dog
7      G4    Mule
8      G4     dog
9      G4     cat
10     G4     cat

Some data used:

#Data
mydf <- structure(list(Groups = c("G1", "G1", "G1", "G1", "G1", "G2", 
"G2", "G2", "G2", "G3", "G3", "G4", "G4", "G4", "G4", "G5", "G5", 
"G5"), COL1 = c("horse", "donkey", "unknown", "snake", "horse", 
"dog", "dog", "unknown", "unknown", "donkey", "dog", "Mule", 
"dog", "cat", "cat", "mule", "donkey", "mule")), class = "data.frame", row.names = c(NA, 
-18L))

Upvotes: 1

Related Questions