Reputation: 1706
I am trying to come up with a sum for each task in a dataset that only uses the largest value observed for the id once in the sum. If that's not clear I've provided an example of the desired output below.
Sample Data
dat <- data.frame(task = rep(LETTERS[1:3], each=3),
id = c(rep(1:2, 4) , 3),
value = c(rep(c(10,20), 4), 5))
dat
task id value
1 A 1 10
2 A 2 20
3 A 1 10
4 B 2 20
5 B 1 10
6 B 2 20
7 C 1 10
8 C 2 20
9 C 3 5
I've found an answer that works, but it requires two separate group_by()
functions. Is there a way to get the same output with a single group_by()
? The reason is I have other summarized metrics that are sensitive to the grouping and I can't run two different group_by
functions in the same pipeline.
dat %>%
group_by(task, id) %>%
summarize(v = max(value)) %>%
group_by(task) %>%
summarize(unique_ids = n_distinct(id),
value_sum = sum(v))
# A tibble: 3 × 3
task unique_ids value_sum
<chr> <int> <dbl>
1 A 2 30
2 B 2 30
3 C 3 35
Upvotes: 1
Views: 1110
Reputation: 1706
I've found something that works using tapply()
.
dat %>%
group_by(task) %>%
summarize(unique_ids = length(unique(id)),
value_sum = sum(tapply(value, id, FUN = max)))
# A tibble: 3 × 3
task unique_ids value_sum
<chr> <int> <dbl>
1 A 2 30
2 B 2 30
3 C 3 35
Upvotes: 1