Reputation: 1950
I have an issue I face sometimes. I want to collapse my data frame and one column should return the largest category within the group, even if there are multiple observations for each category. Example:
library(dplyr)
df <- tibble(grp = c(1, 1, 1, 1, 2, 2, 2, 2),
cat = c("A", "B", "B", "A", "C", "D", "C", "C"),
val = c(1, 2, 1, 4, 1, 8, 2, 1))
# # A tibble: 8 x 3
# grp cat val
# <dbl> <chr> <dbl>
# 1 1 A 1
# 2 1 B 2
# 3 1 B 1
# 4 1 A 4
# 5 2 C 1
# 6 2 D 8
# 7 2 C 2
# 8 2 C 1
Expected output:
# A tibble: 2 x 3
grp val biggest_cat
<dbl> <dbl> <chr>
1 1 8 A
2 2 12 D
Note that for group 2 i want cat D to be returned since the sum of val for D is larger than the sum for cat C.
This works:
df %>%
group_by(grp, cat) %>%
summarise(val = sum(val)) %>%
group_by(grp) %>%
summarise(val = sum(val),
biggest_cat = first(cat, order_by = -val))
But I want to do it without the double summarise:
df %>%
group_by(grp) %>%
summarise(val = sum(val),
biggest_cat = <Some function>)
Maybe there is a forcats solution or something?
Thanks! :)
Upvotes: 0
Views: 54
Reputation: 389047
We could group_by
cat
, grp
to calculate sum
and select row with max
value of sum
in each grp
.
library(dplyr)
df %>%
group_by(grp, cat) %>%
summarise(val = sum(val)) %>%
summarise(cat = cat[which.max(val)],
biggest_cat = sum(val))
To do it with using one summarise
we can use tapply
:
df %>%
group_by(grp) %>%
summarise(total_val = sum(val),
biggest_cat = names(which.max(tapply(val, cat, sum))))
# grp total_val biggest_cat
# <dbl> <dbl> <chr>
#1 1 8 A
#2 2 12 D
Upvotes: 1