Reputation: 3432
Hello I have a df such as
Groups COL1
G1 horse
G1 donkey
G1 unknown
G1 snake
G1 horse
G2 dog
G2 dog
G2 unknown
G2 unknown
G3 donkey
G3 dog
G4 Mule
G4 dog
G4 cat
G4 cat
G5 mule
G5 donkey
G5 mule
and a list
list_not_accepted=c("horse","donkey","mule")
so basically the idea would be to select only groups where the number of COL1
element present in list_not_accepted
/
number of all COL1
value <0.6
.
so here :
G1 3/5 = 0.6
so G1 does not pass
G2 0/5 = 0
so G2 passes
G3 1/2= 0.5
so G3 passes
G4 1/4 = 0.25
so G4 passes
G5 3/3 =1 so G5
does not passe
at the end we should get a df such as :
Groups COL1
G2 dog
G2 dog
G2 unknown
G2 unknown
G3 donkey
G3 dog
G4 Mule
G4 dog
G4 cat
G4 cat
here are the data
> dput(tabl)
structure(list(Groups = c("G1", "G1", "G1", "G1", "G1", "G2",
"G2", "G2", "G2", "G3", "G3", "G4", "G4", "G4", "G4", "G5", "G5",
"G5"), COL1 = c("horse", "donkey", "unknown", "snake", "horse",
"dog", "dog", "unknown", "unknown", "donkey", "dog", "Mule",
"dog", "cat", "cat", "mule", "donkey", "mule")), row.names = c(NA,
-18L), class = "data.frame")
Does someone have an idea please ? thank you very much !
Upvotes: 2
Views: 44
Reputation: 35554
A dplyr
solution with filter()
:
library(dplyr)
df %>%
group_by(Groups) %>%
filter(sum(tolower(COL1) %in% list_not_accepted) / n() < 0.6)
# A tibble: 10 x 2
# Groups: Groups [3]
# Groups COL1
# <chr> <chr>
# 1 G2 dog
# 2 G2 dog
# 3 G2 unknown
# 4 G2 unknown
# 5 G3 donkey
# 6 G3 dog
# 7 G4 Mule
# 8 G4 dog
# 9 G4 cat
# 10 G4 cat
The first element in G4
is "Mule"
. In your description it should match "mule"
of list_not_accepted
, so I turn all COL1
to lower cases before matching.
Upvotes: 2
Reputation: 39595
I would suggest next approach where you can join the original data with small dataframes where you can compute the ratios and make the filter. We have used an inner_join()
to add the results from filtered groups according to threshold. We also use left_join()
in order to merge the groups with the number of cases in COL1
and the number of cases after applying the filter with list_not_accepted
. With those results, ratio variable is computed and the filter is applied. Finally, the expected output is reached. Here the code:
library(tidyverse)
#Data
list_not_accepted=c("horse","donkey","mule")
#Compute
#Join data
mydf %>%
inner_join(mydf %>%
#Compute n obs across all groups
group_by(Groups) %>%
summarise(N=n()) %>%
#Lef join with n obs based on vector list_not_accepted
left_join(mydf %>%
filter(COL1 %in% list_not_accepted) %>%
group_by(Groups) %>%
summarise(N1=n())) %>% replace(is.na(.),0) %>%
#Compute ratio
mutate(Ratio=N1/N) %>%
filter(Ratio<0.6)) %>% select(Groups,COL1)
Output:
Groups COL1
1 G2 dog
2 G2 dog
3 G2 unknown
4 G2 unknown
5 G3 donkey
6 G3 dog
7 G4 Mule
8 G4 dog
9 G4 cat
10 G4 cat
Some data used:
#Data
mydf <- structure(list(Groups = c("G1", "G1", "G1", "G1", "G1", "G2",
"G2", "G2", "G2", "G3", "G3", "G4", "G4", "G4", "G4", "G5", "G5",
"G5"), COL1 = c("horse", "donkey", "unknown", "snake", "horse",
"dog", "dog", "unknown", "unknown", "donkey", "dog", "Mule",
"dog", "cat", "cat", "mule", "donkey", "mule")), class = "data.frame", row.names = c(NA,
-18L))
Upvotes: 1