RobinLovelace
RobinLovelace

Reputation: 4997

How to rapidly sample from groups in R

I have a large dataset, x, that contains replicated values, some of which are duplicated across its variables:

set.seed(40)
x <- data.frame(matrix(round(runif(1000)), ncol = 10))
x_unique <- x[!duplicated(x),]

I need to sample all instances of each unique row in x a given number of times so I create a new variable that is simply a concatenation of the variables for each row:

# Another way of seeing x is as a single value - will be useful later
x_code <- do.call(paste0, x)
u_code <- x_code[!duplicated(x)]

We need a repeated sample sample from x, replicating each unique row s times. This information is provided in the vector s:

s <- rpois(n = nrow(x_unique), lambda = 0.9)

The question is, how to sample individuals from x to reach the quota set by s, for each unique row? Here's a long and unbeautiful way, that gets the right result:

for(i in 1:length(s)){
 xs <- which(x_code %in% u_code[i])
 sel <- c(sel, xs[sample(length(xs), size = s[i], replace = T)])
}

x_sampled <- x[sel, ]

This is slow to run and cumbersome to write.

Is there a way to generate the same result (x_sampled in the above) faster and more concisely? Surely there must be a way!

Upvotes: 2

Views: 190

Answers (2)

hadley
hadley

Reputation: 103948

The key to doing this efficiently is to figure out how to work with the indices, and how to vectorise as much as possible. For your problem, things get much easier if you find the indices for each repeated row:

set.seed(40)
x <- data.frame(matrix(round(runif(1000)), ncol = 10))

index <- 1:nrow(x)
grouped_index <- split(index, x, drop = TRUE)
names(grouped_index) <- NULL

Then you can use Map() to combine the indices to sample from and the number of samples to take for each group. I write a wrapper around sample() to protect against the annoying behaviour when x is of length 1.

sample2 <- function(x, n, ...) {
  if (length(x) == 1) return(rep(x, n))
  sample(x, n, ...)
}

samples <- rpois(n = length(grouped_index), lambda = 0.9)
sel <- unlist(Map(sample2, grouped_index, samples, replace = TRUE))
sel
#>  [1]  66  99  99   2   6  31  90  25  42  57  14  14   8   8  12  77  60
#> [18]  17  17  92  76  76  76  70  95  36  36  36 100  91  41  41  28  69
#> [35]  69  54  54  54  54  81  64  96  35  39  29  11  74  93  82  82  24
#> [52]  46  48  48  48  51  51  73  20  37  71  71  58  16  68  94  94  94
#> [69]  80  80  80  13  13  87  87  67  67  86  49  49  88  88  52  75  47
#> [86]  89   7  79  63  78  72  72  19

If you want to keep in the original order, use sort():

sort(sel)
#>  [1]   2   6   7   8   8  11  12  13  13  14  14  16  17  17  19  20  24
#> [18]  25  28  29  31  35  36  36  36  37  39  41  41  42  46  47  48  48
#> [35]  48  49  49  51  51  52  54  54  54  54  57  58  60  63  64  66  67
#> [52]  67  68  69  69  70  71  71  72  72  73  74  75  76  76  76  77  78
#> [69]  79  80  80  80  81  82  82  86  87  87  88  88  89  90  91  92  93
#> [86]  94  94  94  95  96  99  99 100

I think the bottleneck in this code will be split(): base R doesn't have an efficient way of hashing data frames, so relies on pasting the columns together.

Upvotes: 2

Andrie
Andrie

Reputation: 179558

You can use rep() to create an index vector, followed by subsetting your data using this index vector.

Try this:

idx <- rep(1:length(s), times=s)

The first few values of idx. Note how the second row gets repeated twice, while row 4 is absent:

idx
 [1]  1  2  2  3  6  7  8 10 11 13 14 14 ......

Then do the subsetting. Notice how the new duplicates have row names that indicate the replication.

x_unique[idx, ]

     X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1     1  1  0  0  0  1  0  0  1   0
2     1  0  1  0  0  1  0  0  0   0
2.1   1  0  1  0  0  1  0  0  0   0
3     1  1  0  0  1  0  0  0  1   0
6     0  0  0  0  1  1  0  0  0   0
7     0  1  1  0  1  1  0  1  1   1
8     1  1  0  1  0  0  1  1  0   0
10    0  0  1  0  1  1  1  1  0   0
....

Upvotes: 1

Related Questions