AnilGoyal
AnilGoyal

Reputation: 26218

how to intersect elements of one column along with group_by in R

Lets assume, my data is like

  group_id    col1
1        1     A,B
2        1     B,C
3        2     A,C
4        2     B,D
5        3     A,D
6        3 A,B,C,D

I would like to summarise/mutate col1 where its elements are intersected within same group (over group_id). My required output is like (if summarising)

  group_id col1
1        1    B
2        2 <NA>
3        3  A,D

or like this (if mutating)

  group_id col1
1        1    B
2        1    B
3        2 <NA>
4        2 <NA>
5        3  A,D
6        3  A,D

I can easily create a union by using function toString but scratching my head for how to have common elements in the output. Basically intersect requires at least two arguments and is therefore not working here.

dput(df) is as under

df <-  structure(list(group_id = c(1L, 1L, 2L, 2L, 3L, 3L), col1 = c("A,B", 
"B,C", "A,C", "B,D", "A,D", "A,B,C,D")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

Upvotes: 2

Views: 864

Answers (3)

tmfmnk
tmfmnk

Reputation: 39858

One option using dplyr and tidyr could be:

df %>%
 separate_rows(col1) %>%
 count(group_id, col1) %>%
 group_by(group_id) %>%
 summarise(col1 = if_else(all(n == 1), NA_character_, paste(col1[n == 2], collapse = ",")))

  group_id col1 
     <int> <chr>
1        1 B    
2        2 <NA> 
3        3 A,D  

Upvotes: 1

Karthik S
Karthik S

Reputation: 11584

Would this work:

library(dplyr)
library(tidyr)
df %>% separate_rows(col1) %>% 
   group_by(group_id, col1) %>% filter(n()>1) %>% 
   distinct() %>% group_by(group_id) %>% summarise(col1 = toString(col1)) %>% 
   right_join(df %>% select(group_id) %>% distinct()) %>% 
   arrange(group_id)
`summarise()` ungrouping output (override with `.groups` argument)
Joining, by = "group_id"
# A tibble: 3 x 2
  group_id col1 
     <int> <chr>
1        1 B    
2        2 NA   
3        3 A, D 

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

You can split col1 on comma and use Reduce + intersect to get common values in each group_id.

library(dplyr)
df %>%
  group_by(group_id) %>%
  summarise(col1 = toString(Reduce(intersect, strsplit(col1, ','))))

#  group_id col1  
#*    <int> <chr> 
#1        1 "B"   
#2        2 ""    
#3        3 "A, D"

Upvotes: 1

Related Questions