S_Gill
S_Gill

Reputation: 37

How to propotionally split data using initial_split r

I would like to proportionally split the data I have. For example, I have 100 rows and I want to randomly sample 1 row every two rows. Using tidymodels rsample I assumed I would do the below.

dat <- as_tibble(seq(1:100))

split <- inital_split(dat, prop = 0.5, breaks = 50)

testing <- testing(split)

When checking the data the split hasnt done what I thought it would. It seems close but not exactly. I thought the breaks call generates bins which are sampled from. So, breaks = 50 would split the the 100 rows into 50 bins, therefore having two rows per bin. I have also tried strata = value to strafy accross the rows but I cannot get this to work either.

I am using this as an exaple but I am also curious how this would work when sampling 1 row every four etc.

Have I miss understood the breaks call function?

Upvotes: 1

Views: 879

Answers (1)

Julia Silge
Julia Silge

Reputation: 11623

There is an argument that protects users from trying to create stratified splits that are too small that you are running up against; it's called pool:

library(rsample)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dat <- tibble(value = seq(1:100), strat = as.factor(rep(1:50, each = 2))) 
dat
#> # A tibble: 100 × 2
#>    value strat
#>    <int> <fct>
#>  1     1 1    
#>  2     2 1    
#>  3     3 2    
#>  4     4 2    
#>  5     5 3    
#>  6     6 3    
#>  7     7 4    
#>  8     8 4    
#>  9     9 5    
#> 10    10 5    
#> # … with 90 more rows

split <- initial_split(dat, prop = 0.5, strata = strat, pool = 0.0)
#> Warning: Stratifying groups that make up 0% of the data may be statistically risky.
#> • Consider increasing `pool` to at least 0.1
split
#> <Analysis/Assess/Total>
#> <50/50/100>

training(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#>    value strat
#>    <int> <fct>
#>  1     1 1    
#>  2     4 2    
#>  3     5 3    
#>  4     8 4    
#>  5    10 5    
#>  6    12 6    
#>  7    13 7    
#>  8    16 8    
#>  9    17 9    
#> 10    20 10   
#> # … with 40 more rows
testing(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#>    value strat
#>    <int> <fct>
#>  1     2 1    
#>  2     3 2    
#>  3     6 3    
#>  4     7 4    
#>  5     9 5    
#>  6    11 6    
#>  7    14 7    
#>  8    15 8    
#>  9    18 9    
#> 10    19 10   
#> # … with 40 more rows

Created on 2022-02-22 by the reprex package (v2.0.1)

We really don't recommend turning pool down to zero like this, but you can do it here to see how the strata and prop arguments work.

Upvotes: 2

Related Questions