Reputation: 273
I am searching for a simpler solution to the following problem. Here is my setup:
test <- tibble::tribble(
~group_name, ~id_name, ~varA, ~varB,
"groupA", "id_1", 1, "a",
"groupA", "id_2", 4, "f",
"groupA", "id_3", 5, "g",
"groupA", "id_4", 6, "x",
"groupA", "id_4", 6, "h",
"groupB", "id_1", 2, "s",
"groupB", "id_2", 13, "y",
"groupB", "id_4", 14, "t",
"groupC", "id_1", 3, "d",
"groupC", "id_2", 7, "j",
"groupC", "id_3", 8, "k",
"groupC", "id_4", 9, "l",
"groupC", "id_5", 0, "o",
"groupC", "id_6", 12, "u"
)
I want to select only those elements in id_name
that are common to all groups in group_name
- i.e. drop the rows for ids that are not present in all the groups. My actual data is large (200k rows) with anywhere between 4-20 groups (I don't know the number of groups beforehand so the solution must work for any number of groups). The id_name
in each group is NOT unique. The desired result would be:
test_result <- tibble::tribble(
~group_name, ~id_name, ~varA, ~varB,
"groupA", "id_1", 1, "a",
"groupA", "id_2", 4, "f",
"groupA", "id_4", 6, "x",
"groupA", "id_4", 6, "h",
"groupB", "id_1", 2, "s",
"groupB", "id_2", 13, "y",
"groupB", "id_4", 14, "t",
"groupC", "id_1", 3, "d",
"groupC", "id_2", 7, "j",
"groupC", "id_4", 9, "l",
)
(the rows with ids absent in at least one group are dropped). Ideally I do not want my output to have the columns joined at the end. I want "simply" to drop the rows missing in any one group but maintain the shape of the dataframe.
And I know that I can extract all the ids from each group, then intersect them all to obtain the list of unique ids present in all groups and then filter the main dataframe for just these IDs. But that sounds like a lot of work ;-)
Any hints would be very much appreciated.
Upvotes: 1
Views: 763
Reputation: 46908
You can tabulate the the occurrence of id_name by group_name:
table(test$group_name,test$id_name)
If id_name is present in every group, we want columns that have all > 0 entris. We can simplify this logic using a combination of >0 and colMeans:
keep = names(which(colMeans(table(test$group_name,test$id_name)>0)==1))
Using this:
test[test$id_name %in% keep,]
Upvotes: 0
Reputation: 389047
In base R, we can split
id_name
by group_name
find common id's
and then subset
subset(test, id_name %in% Reduce(intersect, split(id_name, group_name)))
# group_name id_name varA varB
# <chr> <chr> <dbl> <chr>
# 1 groupA id_1 1 a
# 2 groupA id_2 4 f
# 3 groupA id_4 6 x
# 4 groupA id_4 6 h
# 5 groupB id_1 2 s
# 6 groupB id_2 13 y
# 7 groupB id_4 14 t
# 8 groupC id_1 3 d
# 9 groupC id_2 7 j
#10 groupC id_4 9 l
Using similar concept in tidyverse
, it would be
library(tidyverse)
test %>%
filter(id_name %in% (test %>%
group_split(group_name) %>%
map(~pull(., id_name)) %>%
reduce(intersect)))
Upvotes: 1