chris_90
chris_90

Reputation: 35

R keep randomly generating numbers until all numbers within specified range are present

My aim is to randomly generate a vector of integers using R, which is populated by numbers between 1-8. However, I want to keep growing the vector until all the numbers between 1:8 are represented at least once, e.g. 1,4,6,2,2,3,5,1,4,7,6,8.

I am able to generate single numbers or a sequence of numbers using sample

x=sample(1:8,1, replace=T)
>x
[1] 6

I have played around with the repeat function to see how it might work with sample and I can at least get the generation to stop when one specific number occurs, e.g.

repeat {
   print(x)
   x = sample(1:8, 1, replace=T)
   if (x == 3){
       break
   }
} 

Which gives:

[1] 3
[1] 6
[1] 6
[1] 6
[1] 6
[1] 6
[1] 2

I am struggling now to work out how to stop number generation once all numbers between 1:8 are present. Additionally, I know that the above code is only printing the sequence as it is generated and not storing it as a vector. Any advice pointing me in the right direction would be really appreciated!

Upvotes: 1

Views: 956

Answers (2)

d.b
d.b

Reputation: 32548

This is fine for 1:8 but might not always be a good idea.

foo = integer(0)    
set.seed(42)
while(TRUE){
    foo = c(foo, sample(1:8, 1))
    if(all(1:8 %in% foo)) break
}
foo
# [1] 8 8 3 7 6 5 6 2 6 6 4 6 8 3 4 8 8 1

If you have more than 1:8, it may be better to obtain the average number of tries (N) required to get all the numbers at least once and then sample N numbers such that all numbers are sampled at least once.

set.seed(42)
vec = 1:8
N = ceiling(sum(length(vec)/(1:length(vec))))
foo = sample(c(vec, sample(vec, N - length(vec), TRUE)))
foo
# [1] 3 6 8 3 8 8 6 4 5 6 1 6 4 6 6 3 5 7 2 2 7 8

Upvotes: 2

r2evans
r2evans

Reputation: 160417

Taking cue off of d.b, here's a slightly more verbose method that is more memory-efficient (and a little faster too, though I doubt speed is your issue):

Differences:

  • pre-allocate memory in chunks (size 100 here), mitigates the problem with extend-by-one vector work; allocating and extending 100 (or even 1000) at a time is much lower cost
  • compare only the newest number instead of all numbers each time (the first n-1 numbers have already been tabulated, no need to do that again)

Code:

microbenchmark(
  r2evans = {
    emptyvec100 <- integer(100)
    counter <- 0
    out <- integer(0)
    unseen <- seq_len(n)
    set.seed(42)
    repeat {
      if (counter %% 100 == 0) out <- c(out, emptyvec100)
      counter <- counter+1
      num <- sample(n, size=1)
      unseen <- unseen[unseen != num]
      out[counter] <- num
      if (!length(unseen)) break
    }
    out <- out[1:counter]
  },
  d.b = {
    foo = integer(0)    
    set.seed(42)
    while(TRUE){
      foo = c(foo, sample(1:n, 1))
      if(all(1:n %in% foo)) break
    }
  }, times = 100, unit = 'us')
# Unit: microseconds
#     expr      min       lq     mean   median       uq      max neval
#  r2evans 1090.007 1184.639 1411.531 1228.947 1320.845 11344.24  1000
#      d.b 1242.440 1372.264 1835.974 1441.916 1597.267 14592.74  1000

(This is intended neither as code-golf nor speed-optimization. My primary goal is to argue against extend-by-one vector work, and suggest a more efficient comparison technique.)

As d.b further suggested, this works fine for 1:8 but may run into trouble with larger numbers. If we extend n up:

updated benchmark

(Edit: with d.b's code changes, the execution times are much closer, and not nearly as exponential looking. Apparently the removal of unique had significant benefits to his code.)

Upvotes: 1

Related Questions