JustMe
JustMe

Reputation: 11

R: Representative random sampling for 150 values from categories with different group size

I face the problem that I want to have 150 randomly drawn samples from a dataset based on two categories "site" and "species". So, ideally, we have an outcome of 30 samples per site where each species is more or less equally distributed.

Reproducible example:

df <- data.frame(site = rep(c("A", "B", "C", "D", "E"), each = 10), species = c("s1", rep("s2", each = 3), rep("s3", each = 16), rep("s4", each = 13), rep("s5", each = 17)), individual = c(1, 1:3, 1:16, 1:13, 1:17) )

I think using the dplyr function group_by(site, species) and slice_sample() are a good approach which would however sample a certain amount per group and not 150 in total.. Another problem I have now is that slice_sample needs at least the n-amount of samples in each group to work. This is not always given. So, is there a possibility of sampling 150 in total and whenever the desired amount to sample per group is not provided, then just sample others for compensation?

Thanks!

Upvotes: 1

Views: 217

Answers (1)

TimTeaFan
TimTeaFan

Reputation: 18581

One option is to nest_by(site) and then use slice_sample() to draw a sample of 30 from each group. If needed we can use tidyr::unnest() to get one "normal" data.frame containing all samples drawn.

The problem is probably the condition that:

where each species is more or less equally distributed

When we look at your sites we can see that most of the site only have one species. So drawing samples from your original data will lead to specific sites only containing a certain species. Alternatively, we could just sample species and assign a site randomly independent of the fact that this species has never been observed there.

library(dplyr)
library(tidyr)

site_sample <- df %>% 
  nest_by(site) %>% 
  summarise(data = list(slice_sample(data, n = 30, replace = TRUE)))
#> `summarise()` has grouped output by 'site'. You can override using the `.groups`
#> argument.

site_sample
#> # A tibble: 5 x 2
#> # Groups:   site [5]
#>   site  data             
#>   <chr> <list>           
#> 1 A     <tibble [30 x 2]>
#> 2 B     <tibble [30 x 2]>
#> 3 C     <tibble [30 x 2]>
#> 4 D     <tibble [30 x 2]>
#> 5 E     <tibble [30 x 2]>

site_sample %>% 
  unnest(data)
#> # A tibble: 150 x 3
#> # Groups:   site [5]
#>    site  species individual
#>    <chr> <chr>        <dbl>
#>  1 A     s1               1
#>  2 A     s3               1
#>  3 A     s1               1
#>  4 A     s3               5
#>  5 A     s3               3
#>  6 A     s3               4
#>  7 A     s2               2
#>  8 A     s3               3
#>  9 A     s3               5
#> 10 A     s3               2
#> # ... with 140 more rows

original data

df <- data.frame(site = rep(c("A", "B", "C", "D", "E"), each = 10), species = c("s1", rep("s2", each = 3), rep("s3", each = 16), rep("s4", each = 13), rep("s5", each = 17)), individual = c(1, 1:3, 1:16, 1:13, 1:17) ) 

Created on 2022-12-16 by the reprex package (v2.0.1)

Upvotes: 1

Related Questions