Kryštof Chytrý
Kryštof Chytrý

Reputation: 347

Proportional selection of n rows per group

I have the following table on the abundance of two species in groups:

df <- tribble(~name, ~id, ~freq, ~toselect,
"spA",  22, 10,  4,
"spA",  23, 10,  4,
"spA",  21, 8,  4,
"spA",  19, 6,  4,
"spA",  25, 5,  4,
"spA",  26, 4,  4,
"spA",  27, 4,  4,
"spA",  28, 3,  4,
"spA",  29, 3,  4,
"spA",  24, 2,  4,
"spA",  30, 2,  4,
"spA",  20, 1,  4,
"spA",  31, 1,  4,
"spA",  33, 1,  4,

"spB",  27, 9,  2,
"spB",  28, 1,  2,
"spB",  29, 1,  2,
"spB",  24, 1,  2,
"spB",  30, 1,  2,
"spB",  20, 1,  2,
"spB",  31, 1,  2,
"spB",  33, 1,  2) 

I want to select n rows, where n is included as a species specific parameter in the tibble (col "toselect"). However, I want to select those rows based on the frequency of the species in particular group (col "freq"), i.e. duplicates are ok and wanted (e.g. in the case of spB I actually want the algorithm to select the group 27 twice.).

I actually faced two issues. The traditional sample_n(), works well for the selection of desired number of rows.

df %>% group_by(name) %>% 
    sample_n(toselect[1], replace = T)

The other option I thought of is its successor slice_sample(). This is a cool function and works well with duplicates. However, does not work with different number of selected rows per individual groups.

df %>% group_by(name) %>% 
    slice_sample(n = 4, replace = T) # instead of 4 I would like to put there "toselect[1]"

Lastly, none of these two options work for proportional selection. I tried adding the argument weight = freq, but this still produces a random selection. Therefore I ask: is there a way how to do it?

Upvotes: 0

Views: 127

Answers (1)

Dan Chaltiel
Dan Chaltiel

Reputation: 8484

Unfortunately, the n argument of slice_sample() and sample_n() is not vectorized.

Therefore, you have to use a loop-like function to achieve this.

Here, I use a combination of dplyr::group_split() and purrr::map_dfr():

library(tidyverse)

set.seed(0)
df %>% 
  group_split(name) %>% 
  map_dfr(~{
    sample_n(.x, toselect[1], replace = T)
  })
#> # A tibble: 6 x 4
#>   name     id  freq toselect
#>   <chr> <dbl> <dbl>    <dbl>
#> 1 spA      33     1        4
#> 2 spA      29     3        4
#> 3 spA      19     6        4
#> 4 spA      27     4        4
#> 5 spB      27     9        2
#> 6 spB      28     1        2

Created on 2021-05-15 by the reprex package (v2.0.0)

Upvotes: 1

Related Questions