Reputation: 11
I face the problem that I want to have 150 randomly drawn samples from a dataset based on two categories "site" and "species". So, ideally, we have an outcome of 30 samples per site where each species is more or less equally distributed.
Reproducible example:
df <- data.frame(site = rep(c("A", "B", "C", "D", "E"), each = 10), species = c("s1", rep("s2", each = 3), rep("s3", each = 16), rep("s4", each = 13), rep("s5", each = 17)), individual = c(1, 1:3, 1:16, 1:13, 1:17) )
I think using the dplyr function group_by(site, species) and slice_sample() are a good approach which would however sample a certain amount per group and not 150 in total.. Another problem I have now is that slice_sample needs at least the n-amount of samples in each group to work. This is not always given. So, is there a possibility of sampling 150 in total and whenever the desired amount to sample per group is not provided, then just sample others for compensation?
Thanks!
Upvotes: 1
Views: 217
Reputation: 18581
One option is to nest_by(site)
and then use slice_sample()
to draw a sample of 30 from each group. If needed we can use tidyr::unnest()
to get one "normal" data.frame
containing all samples drawn.
The problem is probably the condition that:
where each species is more or less equally distributed
When we look at your site
s we can see that most of the site only have one species. So drawing samples from your original data will lead to specific sites only containing a certain species
. Alternatively, we could just sample species
and assign a site
randomly independent of the fact that this species
has never been observed there.
library(dplyr)
library(tidyr)
site_sample <- df %>%
nest_by(site) %>%
summarise(data = list(slice_sample(data, n = 30, replace = TRUE)))
#> `summarise()` has grouped output by 'site'. You can override using the `.groups`
#> argument.
site_sample
#> # A tibble: 5 x 2
#> # Groups: site [5]
#> site data
#> <chr> <list>
#> 1 A <tibble [30 x 2]>
#> 2 B <tibble [30 x 2]>
#> 3 C <tibble [30 x 2]>
#> 4 D <tibble [30 x 2]>
#> 5 E <tibble [30 x 2]>
site_sample %>%
unnest(data)
#> # A tibble: 150 x 3
#> # Groups: site [5]
#> site species individual
#> <chr> <chr> <dbl>
#> 1 A s1 1
#> 2 A s3 1
#> 3 A s1 1
#> 4 A s3 5
#> 5 A s3 3
#> 6 A s3 4
#> 7 A s2 2
#> 8 A s3 3
#> 9 A s3 5
#> 10 A s3 2
#> # ... with 140 more rows
original data
df <- data.frame(site = rep(c("A", "B", "C", "D", "E"), each = 10), species = c("s1", rep("s2", each = 3), rep("s3", each = 16), rep("s4", each = 13), rep("s5", each = 17)), individual = c(1, 1:3, 1:16, 1:13, 1:17) )
Created on 2022-12-16 by the reprex package (v2.0.1)
Upvotes: 1