Reputation: 79194
I would like to make the same like distinct()
but for groups. Here is an example:
data <- data.frame(
group = c(1, 1, 2, 3, 3, 4, 4, 5, 5),
procedure = c("A", "B", "A", "A", "B", "A", "X", "A", "X")
)
group procedure
1 1 A
2 1 B
3 2 A
4 3 A
5 3 B
6 4 A
7 4 X
8 5 A
9 5 X
I am expecting this:
Note: group_id
is just an interim and not important:
group procedure group_id
<dbl> <chr> <int>
1 1 A 2
2 1 B 2
3 2 A 1
4 4 A 3
5 4 X 3
I use this working code:
library(dplyr)
library(tidyr)
data %>%
summarise(procedure = toString(sort(procedure)), .by = group) %>%
mutate(group_id = as.integer(factor(procedure))) %>%
distinct(group_id, .keep_all = TRUE) %>%
separate_rows(procedure)
Is there a more direct method available? For context, my dataset contains 23,000 rows with numerous groups, and I need to identify and evaluate the main member of each group. Therefore, I'm looking for a way to efficiently distinguish and assess all unique groups. Could you suggest an approach to facilitate this evaluation?
Upvotes: 4
Views: 88
Reputation: 73572
We can table
by group and subset
by non-dupes.
> subset(data1, group %in% rownames(unique(unclass(table(group, procedure)))))
group procedure
1 1 A
2 1 B
3 2 A
6 4 A
7 4 X
We could generalize this.
> distinct_groups <- function(data, ..., .by) {
+ g <- rev(sapply(match.call()[-(1:2)], deparse))
+ data[data[[g[1]]] %in% rownames(unique(unclass(table(data[g])))), ]
+ }
> data2
group procedure foo
1 1.0 A 1
2 1.0 B 1
3 2.0 A 1
4 3.0 A 1
5 3.0 B 1
6 3.1 A 2
7 3.1 B 2
8 4.0 A 1
9 4.0 X 1
10 5.0 A 1
11 5.0 X 1
> data2 |> distinct_groups(procedure, foo, .by=group)
group procedure foo
1 1.0 A 1
2 1.0 B 1
3 2.0 A 1
6 3.1 A 2
7 3.1 B 2
8 4.0 A 1
9 4.0 X 1
> data1 |> distinct_groups(procedure, .by=group)
group procedure
1 1 A
2 1 B
3 2 A
6 4 A
7 4 X
Data:
> dput(data1)
structure(list(group = c(1, 1, 2, 3, 3, 3.1, 3.1, 4, 4, 5, 5),
procedure = c("A", "B", "A", "A", "B", "A", "B", "A", "X",
"A", "X"), foo = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L,
1L)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "4.1", "5.1", "6", "7", "8", "9"))
> dput(data2)
structure(list(group = c(1, 1, 2, 3, 3, 3.1, 3.1, 4, 4, 5, 5),
procedure = c("A", "B", "A", "A", "B", "A", "B", "A", "X",
"A", "X"), foo = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L,
1L)), class = "data.frame", row.names = c(NA, -11L))
Upvotes: 3
Reputation: 102529
I don't know if the code is short enough for you
data %>%
summarise(procedure = list(sort(procedure)), .by = group) %>%
filter(!duplicated(procedure)) %>%
unnest(procedure)
which gives
# A tibble: 5 × 2
group procedure
<dbl> <chr>
1 1 A
2 1 B
3 2 A
4 4 A
5 4 X
Upvotes: 3