Reputation: 1217
I want to generate random values from vector 1:n
without replacement, just as sample(n)
would do. However, instead of saving the permutation in memory, I want to generate the values on demand. similar to a generator in Python.
I imagine something like this:
# not working
rng <- random_permutation(n) # 'on demand' random number generator
x <- next(rng) # lazy creation of new random value (w/o replacement)
Why do I need this? Because n
can be very large, and often only few random values will be needed. Storing the entire 1:n
vector in memory would be very inefficient and not very elegant.
Upvotes: 0
Views: 434
Reputation: 8844
Perhaps just implement it yourself? This post illustrates an efficient algorithm (Durstenfeld-Fisher-Yates) for sampling without replacement in a quite understandable way. It seems not-so-hard to implement that in R. Consider this function:
set_lazy_sample <- function(n) {
npos <- as.integer(n)
cache <- new.env()
search_cache <- function(key) {
out <- cache[[as.character(key)]]
if (is.null(out)) key else out
}
function(size = 1L) {
out <- rep.int(NA_integer_, size)
for (i in seq_len(size)) {
if (npos < 1L) {
warning("Reached sampling limit. Please reset.", call. = FALSE)
break
}
sel <- sample.int(npos, 1L)
out[[i]] <- search_cache(sel)
if (sel != npos) {
cache[[as.character(sel)]] <- search_cache(npos)
}
npos <<- npos - 1L
}
out
}
}
The function works like this:
> set.seed(34)
> f <- set_lazy_sample(10)
> f()
[1] 1
> f(4)
[1] 9 2 8 6
> f(6)
[1] 4 10 3 7 5 NA
Warning message:
Reached sampling limit. Please reset.
> f()
[1] NA
Warning message:
Reached sampling limit. Please reset.
Tested the function with the following specifications:
set.seed(4627)
# draw 4 out of 20 integers without replacement; repeat 100,000 times
simu <- vapply(1:100000, function(x) set_lazy_sample(20L)(4L), integer(4L))
As far as I can tell, the results are evenly distributed.
hist(simu, breaks = 0:20)
Upvotes: 2
Reputation: 44788
Here's a simple implementation of Fisher-Yates that takes advantage of the fact that at first the unsampled values form long sequences, so can be compactly encoded. It stores the differences using run-length encoding, only expanding during sampling. Some ideas for efficiency improvements follow:
onDemand <- function(n) {
# Store the remainder of the deck as differences, starting from
# zero, i.e. initially encode deck <- c(0,1,2, ..., n) as
# rle(diff(deck))
# To do a sample, choose an element after the 0 at random,
# swap it with the last entry, and return it.
remaining <- structure(list(lengths = n, values = 1),
class = "rle")
encode <- function(seq) {
rle(diff(c(0, seq)))
}
decode <- function(enc) {
cumsum(inverse.rle(enc))
}
function(m = 1) {
result <- numeric(m)
remaining <- decode(remaining)
nleft <- length(remaining)
for (i in seq_len(m)) {
if (!nleft)
result[i] <- NA
else {
swap <- sample(nleft, 1)
result[i] <- remaining[swap]
remaining[swap] <- remaining[nleft]
nleft <- nleft - 1
}
}
length(remaining) <- nleft
remaining <<- encode(remaining)
result
}
}
Some notes:
If n
is huge (e.g. a million), the RLE will be pretty small for the first few hundred or thousand samples, but the expanded vector will be big. This could be avoided by writing methods to index directly into the encoded vector without expanding it. It's fairly easy to write methods to extract values, but replacing them is messy, so I didn't bother.
After a lot of samples have been taken, it would probably be more efficient just to store the remaining values without encoding them.
Here is a typical run:
> nextval <- onDemand(1000000)
> nextval(100)
[1] 370610 973737 503494 916342 932407 222542 152900 783549
[9] 249178 138066 626285 611692 805466 406702 630680 11850
[17] 29150 19859 516327 513589 900781 923846 620311 886004
[25] 293838 362498 451400 61116 272106 990026 78768 501649
[33] 442166 867620 533579 679138 350663 840548 820948 586161
[41] 5540 399160 583113 298526 382205 920895 25499 450975
[49] 17561 18395 679743 719144 25850 421673 974477 495473
[57] 681210 773482 175615 71834 163729 441219 992938 722924
[65] 374084 769210 759145 923529 11192 752293 953230 96349
[73] 988377 672156 658830 394943 715904 762062 403089 848479
[81] 962312 303000 680417 760521 515682 237871 823706 119516
[89] 978289 985208 437114 620376 940255 399345 221688 59345
[97] 29765 400142 142375 911747
> environment(nextval)$remaining
Run Length Encoding
lengths: int [1:301] 5539 1 1 5650 1 1 656 1 1 5709 ...
values : num [1:301] 1 994421 -994419 1 988741 -988739 1 988136 -988134 1 ...
Upvotes: 1