R: Sum the Max Values of Unique Rows with dplyr

Question

I am trying to come up with a sum for each task in a dataset that only uses the largest value observed for the id once in the sum. If that's not clear I've provided an example of the desired output below.

Sample Data

dat <- data.frame(task = rep(LETTERS[1:3], each=3), 
                  id = c(rep(1:2, 4) , 3),
                  value = c(rep(c(10,20), 4), 5))
dat
  task id value
1    A  1    10
2    A  2    20
3    A  1    10
4    B  2    20
5    B  1    10
6    B  2    20
7    C  1    10
8    C  2    20
9    C  3     5

I've found an answer that works, but it requires two separate group_by() functions. Is there a way to get the same output with a single group_by()? The reason is I have other summarized metrics that are sensitive to the grouping and I can't run two different group_by functions in the same pipeline.

dat %>% 
  group_by(task, id) %>%
  summarize(v = max(value)) %>%
  group_by(task) %>%
  summarize(unique_ids = n_distinct(id), 
            value_sum = sum(v))

# A tibble: 3 × 3
   task unique_ids value_sum
             
1     A          2        30
2     B          2        30
3     C          3        35

Steven M. Mortimer · Accepted Answer

I've found something that works using tapply().

dat %>% 
  group_by(task) %>%
  summarize(unique_ids = length(unique(id)), 
            value_sum = sum(tapply(value, id, FUN = max)))

# A tibble: 3 × 3
   task unique_ids value_sum
             
1     A          2        30
2     B          2        30
3     C          3        35

R: Sum the Max Values of Unique Rows with dplyr

Answers (1)

Related Questions