B.M Njuguna
B.M Njuguna

Reputation: 109

slice_sample producing different samples in grouped .data

Why do the following grouping methods results in different samples. My assumption was that the grouping results to similar samples?

small <- data.frame(
  id = 1:100,
  gender = rep(c('male', 'female'))
)

set.seed(123)
small |> 
  group_by(gender) |> 
  slice_sample(n = 10, replace = F)

set.seed(123)
small |> 
  slice_sample(n = 10, replace = F, by = gender)

Upvotes: 4

Views: 168

Answers (1)

NicChr
NicChr

Reputation: 1253

Basically when you use .by the groups are sorted by order of first appearance and when you use group_by(), the groups are sorted. Since we see 'small' before 'female', this explains the difference in the results.

My package timeplyr actually has arguments to control this behaviour.

Edit: You can also control this behaviour through fgroup_by(order =)

As to why the actual samples are different within each group, my best guess has is even though the seed is the same, because the sampling is done in a different by-group order, this will affect which samples are drawn.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
small <- data.frame(
  id = 1:100,
  gender = rep(c('male', 'female'))
)

set.seed(123)
res1 <- small |> 
  group_by(gender) |> 
  slice_sample(n = 10, replace = F)

set.seed(123)
res2 <- small |> 
  slice_sample(n = 10, replace = F, by = gender)

library(timeplyr)
#> 
#> Attaching package: 'timeplyr'
#> The following object is masked from 'package:dplyr':
#> 
#>     desc

res3 <- small |> 
  fslice_sample(n = 10, replace = F, .by = gender, seed = 123, sort_groups = TRUE)
res4 <- small |> 
  fslice_sample(n = 10, replace = F, .by = gender, seed = 123, sort_groups = FALSE)

identical(as.data.frame(res1), res3)
#> [1] TRUE
identical(as.data.frame(res2), res4)
#> [1] TRUE

res5 <- small |> 
  fgroup_by(gender, order = TRUE) |> 
  fslice_sample(n = 10, replace = F, seed = 123)
res6 <- small |> 
  fgroup_by(gender, order = FALSE) |> 
  fslice_sample(n = 10, replace = F, seed = 123)

identical(as.data.frame(res1), as.data.frame(res5))
#> [1] TRUE
identical(as.data.frame(res2), as.data.frame(res6))
#> [1] TRUE

Created on 2024-08-01 with reprex v2.0.2

Upvotes: 1

Related Questions