nicholas
nicholas

Reputation: 983

How to sum n highest values by row using dplyr without reshaping?

I would like to create a new column based on the n highest values per row of a data frame.

Take the following example:

library(tibble)
df <- tribble(~name, ~q_1, ~q_2, ~q_3, ~sum_top_2,
              "a", 4, 1, 5, 9,
              "b", 2, 8, 9, 17)

Here, the sum_top_2 column sums the 2 highest values of columns prefixed with "q_". I would like to generalize to the n highest values by row. How can I do this using dplyr without reshaping?

Upvotes: 6

Views: 343

Answers (1)

akrun
akrun

Reputation: 887078

One option is pmap from purrr to loop over the rows of the columns that starts_with 'q_', by sorting the row in decreasing order, get the first 'n' sorted elements with head and sum

library(dplyr)
library(purrr)
library(stringr)
n <- 2
df %>% 
   mutate(!! str_c("sum_top_", n) := pmap_dbl(select(cur_data(), 
           starts_with('q_')), 
            ~ sum(head(sort(c(...), decreasing = TRUE), n))))

-output

# A tibble: 2 x 5
  name    q_1   q_2   q_3 sum_top_2
  <chr> <dbl> <dbl> <dbl>     <dbl>
1 a         4     1     5         9
2 b         2     8     9        17

Or use rowwise from dplyr.

df %>% 
   rowwise %>% 
   mutate(!! str_c("sum_top_", n) := sum(head(sort(c_across(starts_with("q_")), 
           decreasing = TRUE), n))) %>% 
   ungroup
# A tibble: 2 x 5
  name    q_1   q_2   q_3 sum_top_2
  <chr> <dbl> <dbl> <dbl>     <dbl>
1 a         4     1     5         9
2 b         2     8     9        17

Upvotes: 4

Related Questions