TarJae
TarJae

Reputation: 79194

Direct way for 'distinct() groupwise'

I would like to make the same like distinct() but for groups. Here is an example:

data <- data.frame(
  group = c(1, 1, 2, 3, 3, 4, 4, 5, 5),
  procedure = c("A", "B", "A", "A", "B", "A", "X", "A", "X")
)

  group procedure
1     1         A
2     1         B
3     2         A
4     3         A
5     3         B
6     4         A
7     4         X
8     5         A
9     5         X

I am expecting this:

Note: group_id is just an interim and not important:

 group procedure group_id
  <dbl> <chr>              <int>
1     1 A                      2
2     1 B                      2
3     2 A                      1
4     4 A                      3
5     4 X                      3

I use this working code:

library(dplyr)
library(tidyr)

data %>%
  summarise(procedure = toString(sort(procedure)), .by = group) %>%
  mutate(group_id = as.integer(factor(procedure))) %>% 
  distinct(group_id, .keep_all = TRUE) %>% 
  separate_rows(procedure)

Is there a more direct method available? For context, my dataset contains 23,000 rows with numerous groups, and I need to identify and evaluate the main member of each group. Therefore, I'm looking for a way to efficiently distinguish and assess all unique groups. Could you suggest an approach to facilitate this evaluation?

Upvotes: 4

Views: 88

Answers (2)

jay.sf
jay.sf

Reputation: 73572

We can table by group and subset by non-dupes.

> subset(data1, group %in% rownames(unique(unclass(table(group, procedure)))))
  group procedure
1     1         A
2     1         B
3     2         A
6     4         A
7     4         X

We could generalize this.

> distinct_groups <- function(data, ..., .by) {
+   g <- rev(sapply(match.call()[-(1:2)], deparse))
+   data[data[[g[1]]] %in% rownames(unique(unclass(table(data[g])))), ]
+ }
> data2
   group procedure foo
1    1.0         A   1
2    1.0         B   1
3    2.0         A   1
4    3.0         A   1
5    3.0         B   1
6    3.1         A   2
7    3.1         B   2
8    4.0         A   1
9    4.0         X   1
10   5.0         A   1
11   5.0         X   1
> data2 |> distinct_groups(procedure, foo, .by=group)
  group procedure foo
1   1.0         A   1
2   1.0         B   1
3   2.0         A   1
6   3.1         A   2
7   3.1         B   2
8   4.0         A   1
9   4.0         X   1
> data1 |> distinct_groups(procedure, .by=group)
  group procedure
1     1         A
2     1         B
3     2         A
6     4         A
7     4         X

Data:

> dput(data1)
structure(list(group = c(1, 1, 2, 3, 3, 3.1, 3.1, 4, 4, 5, 5), 
    procedure = c("A", "B", "A", "A", "B", "A", "B", "A", "X", 
    "A", "X"), foo = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 
    1L)), class = "data.frame", row.names = c("1", "2", "3", 
"4", "5", "4.1", "5.1", "6", "7", "8", "9"))
> dput(data2)
structure(list(group = c(1, 1, 2, 3, 3, 3.1, 3.1, 4, 4, 5, 5), 
    procedure = c("A", "B", "A", "A", "B", "A", "B", "A", "X", 
    "A", "X"), foo = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 
    1L)), class = "data.frame", row.names = c(NA, -11L))

Upvotes: 3

ThomasIsCoding
ThomasIsCoding

Reputation: 102529

I don't know if the code is short enough for you

data %>%
    summarise(procedure = list(sort(procedure)), .by = group) %>%
    filter(!duplicated(procedure)) %>%
    unnest(procedure)

which gives

# A tibble: 5 × 2
  group procedure
  <dbl> <chr>
1     1 A
2     1 B
3     2 A
4     4 A
5     4 X

Upvotes: 3

Related Questions