Steven M. Mortimer
Steven M. Mortimer

Reputation: 1706

R: Sum the Max Values of Unique Rows with dplyr

I am trying to come up with a sum for each task in a dataset that only uses the largest value observed for the id once in the sum. If that's not clear I've provided an example of the desired output below.

Sample Data

dat <- data.frame(task = rep(LETTERS[1:3], each=3), 
                  id = c(rep(1:2, 4) , 3),
                  value = c(rep(c(10,20), 4), 5))
dat
  task id value
1    A  1    10
2    A  2    20
3    A  1    10
4    B  2    20
5    B  1    10
6    B  2    20
7    C  1    10
8    C  2    20
9    C  3     5

I've found an answer that works, but it requires two separate group_by() functions. Is there a way to get the same output with a single group_by()? The reason is I have other summarized metrics that are sensitive to the grouping and I can't run two different group_by functions in the same pipeline.

dat %>% 
  group_by(task, id) %>%
  summarize(v = max(value)) %>%
  group_by(task) %>%
  summarize(unique_ids = n_distinct(id), 
            value_sum = sum(v))

# A tibble: 3 × 3
   task unique_ids value_sum
  <chr>      <int>     <dbl>
1     A          2        30
2     B          2        30
3     C          3        35

Upvotes: 1

Views: 1110

Answers (1)

Steven M. Mortimer
Steven M. Mortimer

Reputation: 1706

I've found something that works using tapply().

dat %>% 
  group_by(task) %>%
  summarize(unique_ids = length(unique(id)), 
            value_sum = sum(tapply(value, id, FUN = max)))

# A tibble: 3 × 3
   task unique_ids value_sum
  <chr>      <int>     <dbl>
1     A          2        30
2     B          2        30
3     C          3        35

Upvotes: 1

Related Questions