Elis
Elis

Reputation: 54

Which groups have exactly the same rows

If I have a data frame like the following

group1 group2 col1 col2
A 1 ABC 5
A 1 DEF 2
B 1 AB 1
C 1 ABC 5
C 1 DEF 2
A 2 BC 8
B 2 AB 1

We can see that the the (A, 1) and (C, 1) groups have the same rows (since col1 and col2 are the same within this group). The same is true for (B,1) and (B, 2).

So really we are left with 3 distinct "larger groups" (call them categories) in this data frame, namely:

category group1 group2
1 A 1
1 C 1
2 B 1
2 B 2
3 A 2

And I am wondering how can I return the above data frame in R given a data frame like the first? The order of the "category" column doesn't matter here, for example (A,2) could be group 1 instead of {(A,1), (C,1)}, as long as these have a distinct category index.

I have tried a few very long/inefficient ways of doing this in Dplyr but I'm sure there must be a more efficient way to do this. Thanks

Upvotes: 1

Views: 145

Answers (2)

Maël
Maël

Reputation: 52329

You can use pivot_wider first to handle identical groups over multiple rows.

library(tidyverse)

df %>% 
  group_by(group1, group2) %>% 
  mutate(id = row_number()) %>% 
  pivot_wider(names_from = id, values_from = c(col1, col2)) %>% 
  group_by(across(-c(group1, group2))) %>% 
  mutate(category = cur_group_id()) %>% 
  ungroup() %>% 
  select(category, group1, group2) %>% 
  arrange(category)
  category group1 group2
     <int> <chr>   <int>
1        1 B           1
2        1 B           2
3        2 A           1
4        2 C           1
5        3 A           2

Upvotes: 2

Quinten
Quinten

Reputation: 41533

You could first group_by "col1" and "col2" and select the duplicated rows. Next, you can create a unique ID using cur_group_id like this:

library(dplyr)
library(tidyr)
df %>%
  group_by(col1, col2) %>%
  filter(n() != 1) %>%
  mutate(ID = cur_group_id()) %>%
  ungroup() %>%
  select(-starts_with("col"))
#> # A tibble: 6 × 3
#>   group1 group2    ID
#>   <chr>   <int> <int>
#> 1 A           1     2
#> 2 A           1     3
#> 3 B           1     1
#> 4 C           1     2
#> 5 C           1     3
#> 6 B           2     1

Created on 2022-08-12 by the reprex package (v2.0.1)

Upvotes: 1

Related Questions