Reputation: 37
I would like to proportionally split the data I have. For example, I have 100 rows and I want to randomly sample 1 row every two rows. Using tidymodels rsample I assumed I would do the below.
dat <- as_tibble(seq(1:100))
split <- inital_split(dat, prop = 0.5, breaks = 50)
testing <- testing(split)
When checking the data the split hasnt done what I thought it would. It seems close but not exactly. I thought the breaks call generates bins which are sampled from. So, breaks = 50
would split the the 100 rows into 50 bins, therefore having two rows per bin. I have also tried strata = value
to strafy accross the rows but I cannot get this to work either.
I am using this as an exaple but I am also curious how this would work when sampling 1 row every four etc.
Have I miss understood the breaks call function?
Upvotes: 1
Views: 879
Reputation: 11623
There is an argument that protects users from trying to create stratified splits that are too small that you are running up against; it's called pool
:
library(rsample)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
dat <- tibble(value = seq(1:100), strat = as.factor(rep(1:50, each = 2)))
dat
#> # A tibble: 100 × 2
#> value strat
#> <int> <fct>
#> 1 1 1
#> 2 2 1
#> 3 3 2
#> 4 4 2
#> 5 5 3
#> 6 6 3
#> 7 7 4
#> 8 8 4
#> 9 9 5
#> 10 10 5
#> # … with 90 more rows
split <- initial_split(dat, prop = 0.5, strata = strat, pool = 0.0)
#> Warning: Stratifying groups that make up 0% of the data may be statistically risky.
#> • Consider increasing `pool` to at least 0.1
split
#> <Analysis/Assess/Total>
#> <50/50/100>
training(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#> value strat
#> <int> <fct>
#> 1 1 1
#> 2 4 2
#> 3 5 3
#> 4 8 4
#> 5 10 5
#> 6 12 6
#> 7 13 7
#> 8 16 8
#> 9 17 9
#> 10 20 10
#> # … with 40 more rows
testing(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#> value strat
#> <int> <fct>
#> 1 2 1
#> 2 3 2
#> 3 6 3
#> 4 7 4
#> 5 9 5
#> 6 11 6
#> 7 14 7
#> 8 15 8
#> 9 18 9
#> 10 19 10
#> # … with 40 more rows
Created on 2022-02-22 by the reprex package (v2.0.1)
We really don't recommend turning pool
down to zero like this, but you can do it here to see how the strata
and prop
arguments work.
Upvotes: 2