tmfmnk
tmfmnk

Reputation: 39858

Efficient recursive random sampling with groups of unequal size

This question is a follow-up to my previous question on recursive random sampling Efficient recursive random sampling. The solutions in that thread work fine when the groups are of identical size or when a fixed number of samples per group is required. However, let's imagine a dataset as follows;

   ID1 ID2
1    A   1
2    A   6
3    B   1
4    B   2
5    B   3
6    C   4
7    C   5
8    C   6
9    D   6
10   D   7
11   D   8
12   D   9

where we want to randomly sample up to n ID2 for each ID1, and doing so recursively. Recursively here means that we are moving from the first ID1 to the last ID1, and if an ID2 was already sampled for an ID1, then it should not be used for a subsequent ID1. Let's say n = 2, then expected results would be as follows;

ID1 ID2
1    A   1
2    A   6
4    B   2
5    B   3
6    C   4
7    C   5
11   D   8
12   D   9

What can happen beyond the situation shown in the example;

Sample df;

df <- structure(list(ID1 = c("A", "A", "B", "B", "B", "C", "C", "C", 
"D", "D", "D", "D"), ID2 = c(1, 6, 1, 2, 3, 4, 5, 6, 6, 7, 8, 
9)), class = "data.frame", row.names = c(NA, -12L))

Upvotes: 6

Views: 412

Answers (3)

ThomasIsCoding
ThomasIsCoding

Reputation: 101189

Here is a base R option using dynamic programming (DP)

d <- table(df)
nms <- dimnames(d)
res <- list()
for (i in nms$ID1) {
  idx <- which(d[i, ] > 0)
  if (length(idx) >= 2) {
    j <- sample(idx, 2)
    res[[i]] <- nms$ID2[j]
    d[, j] <- 0
  }
}
dfout <- type.convert(
  setNames(rev(stack(res)), names(df)),
  as.is = TRUE
)

which gives

  ID1 ID2
1   A   6
2   A   1
3   B   2
4   B   3
5   C   4
6   C   5
7   D   7
8   D   8

For the case with used ID2 already, e.g.,

> (df <- structure(list(ID1 = c(
+   "A", "A", "B", "B", "B", "C", "C", "C",
+   "D", "D", "D", "D"
+ ), ID2 = c(
+   1, 3, 1, 2, 3, 3, 4, 5, 4, 5, 6, .... [TRUNCATED]
   ID1 ID2
1    A   1
2    A   3
3    B   1
4    B   2
5    B   3
6    C   3
7    C   4
8    C   5
9    D   4
10   D   5
11   D   6
12   D   1

we will obtain

  ID1 ID2
1   A   1
2   A   3
3   C   5
4   C   4

Upvotes: 1

Onyambu
Onyambu

Reputation: 79208

I dont know whether I am oversimplifying the problem. Take a look at the following and see whether it works in your case:

library(tidyverse)


df %>%
  group_split(ID1)%>%
  reduce(~ bind_rows(.x, .y) %>%
           filter(!duplicated(ID2))%>%
           group_by(ID1)%>%
           slice_sample(n=2) %>%
           ungroup, 
         .init = slice_sample(.[[1]], n=2))

# A tibble: 8 x 2
  ID1     ID2
  <chr> <dbl>
1 A         1
2 A         6
3 B         2
4 B         3
5 C         4
6 C         5
7 D         9
8 D         8

Disclaimer: NOt vectorized, thus inefficient

Upvotes: 0

cazman
cazman

Reputation: 1492

The following function seems to give what you are after. Basically, it loops through each group of ID1 and selects the rows where the corresponding ID2 has not been sampled. Then it selects the distinct rows (in the case that some group of ID1 has duplicate ID2 values. The sample size will be the minimum of either n, or the number of rows for that group.

sample <- function(df, n) {
  `%notin%` <- Negate(`%in%`)
  groups <- unique(df$ID1)
  out <- data.frame(ID1 = character(), ID2 = character())
  
  for (group in groups) {
    options <- df %>%
      filter(ID1 == group,
             ID2 %notin% out$ID2)
    
    chosen <- sample_n(options,
                       size = min(n, nrow(options))) %>%
      distinct()
    
    out <- rbind(out, chosen)
  }
  
  out
}

set.seed(123)
sample(df, 2)

  ID1 ID2
1   A   1
2   A   6
3   B   2
4   B   3
5   C   4
6   C   5
7   D   8
8   D   9

Case where a group of ID1 has ID2s that were already used up: Input:

# A tibble: 10 × 2
   ID1     ID2
   <chr> <dbl>
 1 A         1
 2 A         3
 3 B         1
 4 B         3
 5 C         5
 6 C         6
 7 C         7
 8 C         7
 9 D        10
10 D        20

Output:

sample(df2, 2)
# A tibble: 6 × 2
  ID1     ID2
  <chr> <dbl>
1 A         3
2 A         1
3 C         6
4 C         7
5 D        20
6 D        10

Upvotes: 1

Related Questions