max
max

Reputation: 4521

Top "n" rows of each group using dplyr -- with different number per group

I'll use the built-in chickwts data as an example.

Here's the data, there are 5 feed types.

> head(chickwts)

  weight      feed
1    179 horsebean
2    160 horsebean
3    136 horsebean
4    227 horsebean
5    217 horsebean
6    168 horsebean

> table(chickwts$feed)

   casein horsebean   linseed  meatmeal   soybean sunflower 
       12        10        12        11        14        12 

What I want is the top rows by weight for every feed type. However, I need a different number for each feed type? For example,

top_n_feed <-
  c(
    "casein" = 3,
    "horsebean" = 5,
    "linseed" = 3,
    "meatmeal" = 6,
    "soybean" = 3,
    "sunflower" = 2
  )

How can I do this using dplyr?

To get the top n rows of each feed type by weight I can use code as below, but I'm not sure how to extend this to a different number for each feed type.

chickwts %>%
  group_by(feed) %>% 
  slice_max(order_by = weight, n = 5)

Upvotes: 4

Views: 1245

Answers (4)

mt1022
mt1022

Reputation: 17289

Another way using split and map2:

library(dplyr)
library(purrr)

chickwts %>%
filter(feed %in% names(top_n_feed)) %>%
split(.$feed) %>% 
map2_dfr(top_n_feed[names(.)], ~slice_max(.x, order_by = weight, n = .y))

Upvotes: 1

Ewen
Ewen

Reputation: 1381

Any time you have a named list think purrr::imap. Avoid joins if not required, particuarly when working at scale.

library(dplyr)
library(purrr)

top_n_feed <- c(
    "casein" = 3,
    "horsebean" = 5,
    "linseed" = 3,
    "meatmeal" = 6,
    "soybean" = 3,
    "sunflower" = 2
  )

imap_dfr(top_n_feed, ~ filter(chickwts, feed %in% .y) %>% 
           slice_max(order_by = weight, n = .x))

   weight      feed
1     404    casein
2     390    casein
3     379    casein
4     227 horsebean
5     217 horsebean
6     179 horsebean
7     168 horsebean
8     160 horsebean
9     309   linseed
10    271   linseed
11    260   linseed
12    380  meatmeal
13    344  meatmeal
14    325  meatmeal
15    315  meatmeal
16    303  meatmeal
17    263  meatmeal
18    329   soybean
19    327   soybean
20    316   soybean
21    423 sunflower
22    392 sunflower

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 388807

Bring top_n_feed in chickwts dataframe and select top n rows for each group.

library(dplyr)

tibble::enframe(top_n_feed, name = 'feed') %>% 
        left_join(chickwts, by = 'feed') %>%
        group_by(feed) %>%
        top_n(first(value), weight)

#   feed      value weight
#   <chr>     <dbl>  <dbl>
# 1 casein        3    390
# 2 casein        3    379
# 3 casein        3    404
# 4 horsebean     5    179
# 5 horsebean     5    160
# 6 horsebean     5    227
# 7 horsebean     5    217
# 8 horsebean     5    168
# 9 linseed       3    309
#10 linseed       3    260
# … with 12 more rows

For some reason I was not able to make slice_sample work for this example.

Upvotes: 1

MrFlick
MrFlick

Reputation: 206167

This isn't really something that dplyr names easy. I'd recommend merging in the data and then filtering.


tibble(feed=names(top_n_feed), topn=top_n_feed) %>% 
  inner_join(chickwts) %>% 
  group_by(feed) %>% 
  arrange(desc(weight), .by_group=TRUE) %>% 
  filter(row_number() <= topn) %>%
  select(-topn)

Upvotes: 6

Related Questions