Reputation: 2126
I want to sample rows from a data frame using unequal sample sizes from each group.
Let's say we have a simple data frame grouped by 'group':
library(dplyr)
set.seed(123)
df <- data.frame(group = rep(c("A", "B"), each = 10),
value = rnorm(10))
df
#> group value
#> 1 A -0.56047565
#> 2 A -0.23017749
#> .....
#> 10 A -0.44566197
#> 11 B -0.56047565
#> 12 B -0.23017749
#> .....
#> 20 B -0.44566197
Using the slice_sample
function from the dplyr
package, you can easily slice equally sized groups from this dataframe:
df %>% group_by(group) %>% slice_sample(n = 2) %>% ungroup()
#> # A tibble: 4 x 2
#> group value
#> <fct> <dbl>
#> 1 A -0.687
#> 2 A -0.446
#> 3 B -0.687
#> 4 B 1.56
Question
How do you sample a different number of values from each group (slice groups that are not equal in size)? For example, sample 4 rows from group A, and 5 rows from group B?
Upvotes: 11
Views: 3215
Reputation: 2030
Just adding an alternate answer that uses nest/unnest
:
library(tidyverse)
set.seed(123)
df <- data.frame(
group = rep(c("A", "B"), each = 10),
value = rnorm(10)
)
df %>%
nest(data = value) %>%
mutate(
sample_size = c(4, 5),
data_sample = map2(data, sample_size, ~ slice_sample(.x, n = .y))
) %>%
select(group, data_sample) %>%
unnest(cols = data_sample)
#> # A tibble: 9 × 2
#> group value
#> <chr> <dbl>
#> 1 A -0.687
#> 2 A -0.446
#> 3 A -0.560
#> 4 A 0.129
#> 5 B 1.56
#> 6 B -1.27
#> 7 B -0.230
#> 8 B 0.461
#> 9 B -0.687
Created on 2022-10-28 by the reprex package (v2.0.1)
Upvotes: 0
Reputation: 193507
You can use the stratified
function from my "splitstackshape" package:
> library(splitstackshape)
> stratified(df, "group", c(A = 4, B = 5))
group value
1: A -0.6868529
2: A 0.4609162
3: A 1.7150650
4: A -0.4456620
5: B 0.4609162
6: B -0.4456620
7: B 0.1292877
8: B -1.2650612
9: B -0.2301775
Upvotes: 3
Reputation: 67778
Another data.table
possibility based on a join.
Put the group-specific sample sizes in a "lookup table" (here, a list, .(...)
); join with original data on 'group' (on = .(group)
); for each match in i
(by = .EACHI
), pick a sample from 'value' of size = size[1]
)
setDT(df)[.(group = c("A", "B"), size = c(4, 5)), on = .(group), sample(value, size[1]),
by = .EACHI]
# group V1
# 1: A -0.6868529
# 2: A -0.4456620
# 3: A -0.5604756
# 4: A 0.1292877
# 5: B 1.5587083
# 6: B -1.2650612
# 7: B -0.2301775
# 8: B 0.4609162
# 9: B -0.6868529
Upvotes: 1
Reputation: 6132
set.seed(123)
library(tidyverse)
map2_df(unique(df$group), c(4,5),
~df %>%
filter(group == .x) %>%
slice_sample(n = .y))
group value
1 A -0.3724388
2 A -0.4168576
3 A 0.5629895
4 A -1.2601552
5 B 1.0527115
6 B -0.3745809
7 B 0.9769734
8 B -0.4168576
9 B -1.0491770
In case your data has not been arranged yet, you may use the following:
map2_df(unique(sort(df$group)), c(4,5),
~df %>% arrange(group) %>%
filter(group == .x) %>%
slice_sample(n = .y))
Upvotes: 1
Reputation:
The easiest thing I can think of is a map2
solution using purrr
.
library(dplyr)
library(purrr)
df %>%
group_split(group) %>%
map2_dfr(c(4, 5), ~ slice_sample(.x, n = .y))
# A tibble: 9 x 2
group value
<chr> <dbl>
1 A -0.687
2 A 1.56
3 A 0.0705
4 A 1.72
5 B -0.560
6 B 0.461
7 B 0.129
8 B 0.0705
9 B -0.230
A caution is that you need to understand the order of the split. I think group_split()
will sort the group as factors. A way around that would be to adapt like this, and lookup the n
from a named vector.
group_slice_n <- c(A = 4, B = 5)
df %>%
split(.$group) %>%
imap_dfr(~ slice_sample(.x, n = group_slice_n[.y]))
Upvotes: 11
Reputation: 27732
a data.table
approach, with the use of mapply
for looping over list-elemenst with sample-size in a vector (with length of list!)
library( data.table )
setDT(df) #make it a data.table
L <- split( df, by = "group" ) #split to a list by group
#function
mysamples <- function( dt, samplesize ) {
dt[ sample( 1:nrow(dt), samplesize), ]
}
#mapply
mapply( mysamples, L, samplesize = c(4,5), SIMPLIFY = FALSE )
#output
# $A
# group value
# 1: A -0.6868529
# 2: A -0.4456620
# 3: A -0.5604756
# 4: A 0.1292877
#
# $B
# group value
# 1: B 1.5587083
# 2: B -1.2650612
# 3: B -0.2301775
# 4: B 0.4609162
# 5: B -0.6868529
Upvotes: 1
Reputation: 160407
Try this:
group_sizes <- tibble(group = c("A", "B"), size = c(4, 5))
set.seed(2021)
df %>%
left_join(group_sizes, by = "group") %>%
group_by(group) %>%
mutate(samp = sample(n())) %>%
filter(samp <= size) %>%
ungroup()
# # A tibble: 9 x 4
# group value size samp
# <chr> <dbl> <dbl> <int>
# 1 A 0.0705 4 2
# 2 A 0.129 4 4
# 3 A -0.687 4 1
# 4 A -0.446 4 3
# 5 B -0.560 5 5
# 6 B 1.56 5 1
# 7 B 0.129 5 4
# 8 B 1.72 5 3
# 9 B -1.27 5 2
Upvotes: 6