Tobias Hotzenplotz
Tobias Hotzenplotz

Reputation: 1217

Lazy random number generation without replacement in R

I want to generate random values from vector 1:n without replacement, just as sample(n) would do. However, instead of saving the permutation in memory, I want to generate the values on demand. similar to a generator in Python.

I imagine something like this:

# not working
rng <- random_permutation(n)  # 'on demand' random number generator
x <- next(rng)  # lazy creation of new random value (w/o replacement)

Why do I need this? Because n can be very large, and often only few random values will be needed. Storing the entire 1:n vector in memory would be very inefficient and not very elegant.

Upvotes: 0

Views: 434

Answers (2)

ekoam
ekoam

Reputation: 8844

Perhaps just implement it yourself? This post illustrates an efficient algorithm (Durstenfeld-Fisher-Yates) for sampling without replacement in a quite understandable way. It seems not-so-hard to implement that in R. Consider this function:

set_lazy_sample <- function(n) {
  npos <- as.integer(n)
  cache <- new.env()
  search_cache <- function(key) {
    out <- cache[[as.character(key)]]
    if (is.null(out)) key else out
  }
  function(size = 1L) {
    out <- rep.int(NA_integer_, size)
    for (i in seq_len(size)) {
      if (npos < 1L) {
        warning("Reached sampling limit. Please reset.", call. = FALSE)
        break
      }
      sel <- sample.int(npos, 1L)
      out[[i]] <- search_cache(sel)
      if (sel != npos) {
        cache[[as.character(sel)]] <- search_cache(npos)
      }
      npos <<- npos - 1L
    }
    out
  }
}

The function works like this:

> set.seed(34)
> f <- set_lazy_sample(10)
> f()
[1] 1
> f(4)
[1] 9 2 8 6
> f(6)
[1]  4 10  3  7  5 NA
Warning message:
Reached sampling limit. Please reset. 
> f()
[1] NA
Warning message:
Reached sampling limit. Please reset. 

Tested the function with the following specifications:

set.seed(4627)
# draw 4 out of 20 integers without replacement; repeat 100,000 times
simu <- vapply(1:100000, function(x) set_lazy_sample(20L)(4L), integer(4L))

As far as I can tell, the results are evenly distributed.

hist(simu, breaks = 0:20)

enter image description here

Upvotes: 2

user2554330
user2554330

Reputation: 44788

Here's a simple implementation of Fisher-Yates that takes advantage of the fact that at first the unsampled values form long sequences, so can be compactly encoded. It stores the differences using run-length encoding, only expanding during sampling. Some ideas for efficiency improvements follow:

onDemand <- function(n) {
  # Store the remainder of the deck as differences, starting from
  # zero, i.e. initially encode deck <- c(0,1,2, ..., n) as
  # rle(diff(deck))
  # To do a sample, choose an element after the 0 at random,
  # swap it with the last entry, and return it.
  remaining <- structure(list(lengths = n, values = 1),
                         class = "rle")
  encode <- function(seq) {
    rle(diff(c(0, seq)))
  }
  decode <- function(enc) {
    cumsum(inverse.rle(enc))
  }
  function(m = 1) {
    result <- numeric(m)
    remaining <- decode(remaining)
    nleft <- length(remaining)
    for (i in seq_len(m)) {
      if (!nleft)
        result[i] <- NA
      else {
        swap <- sample(nleft, 1)
        result[i] <- remaining[swap]
        remaining[swap] <- remaining[nleft]
        nleft <- nleft - 1
      }
    }
    length(remaining) <- nleft
    remaining <<- encode(remaining)
    result
  }
}

Some notes:

If n is huge (e.g. a million), the RLE will be pretty small for the first few hundred or thousand samples, but the expanded vector will be big. This could be avoided by writing methods to index directly into the encoded vector without expanding it. It's fairly easy to write methods to extract values, but replacing them is messy, so I didn't bother.

After a lot of samples have been taken, it would probably be more efficient just to store the remaining values without encoding them.

Here is a typical run:

> nextval <- onDemand(1000000)
> nextval(100)
  [1] 370610 973737 503494 916342 932407 222542 152900 783549
  [9] 249178 138066 626285 611692 805466 406702 630680  11850
 [17]  29150  19859 516327 513589 900781 923846 620311 886004
 [25] 293838 362498 451400  61116 272106 990026  78768 501649
 [33] 442166 867620 533579 679138 350663 840548 820948 586161
 [41]   5540 399160 583113 298526 382205 920895  25499 450975
 [49]  17561  18395 679743 719144  25850 421673 974477 495473
 [57] 681210 773482 175615  71834 163729 441219 992938 722924
 [65] 374084 769210 759145 923529  11192 752293 953230  96349
 [73] 988377 672156 658830 394943 715904 762062 403089 848479
 [81] 962312 303000 680417 760521 515682 237871 823706 119516
 [89] 978289 985208 437114 620376 940255 399345 221688  59345
 [97]  29765 400142 142375 911747
> environment(nextval)$remaining
Run Length Encoding
  lengths: int [1:301] 5539 1 1 5650 1 1 656 1 1 5709 ...
  values : num [1:301] 1 994421 -994419 1 988741 -988739 1 988136 -988134 1 ...

Upvotes: 1

Related Questions