Mohan
Mohan

Reputation: 8863

Is there a faster way to write code to draw many random samples from 1:N?

I'm drawing many sorted samples of size k (typically ~5) from 1:35. The code I am using is:

replicate(1000000, sort(sample(1:35, size = 5)) )

But this is rather slow. I assume that since I am using replicate, no parallelisation is happening. Is there a faster way to write this code?

Upvotes: 4

Views: 105

Answers (3)

ThomasIsCoding
ThomasIsCoding

Reputation: 102710

We can still stick to base R and make it faster if we apply order outside replicate, rather than sort inside, for example

set.seed(0)
N <- 1000000
mohan <- \(N) replicate(N, sort(sample(1:35, size = 5)))

tic <- \(N) {
  v <- replicate(N, sample(1:35, size = 5))
  `dim<-`(v[order(col(v), v)], dim(v))
}

microbenchmark(
  mohan(N),
  tic(N),
  times = 10L,
  unit = "relative"
)

and you can see that

Unit: relative
     expr      min       lq     mean   median       uq      max neval
 mohan(N) 5.334456 5.036718 4.921129 4.884663 4.647009 4.590792    10
   tic(N) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10

and

> system.time(mohan(N))
   user  system elapsed 
  33.66    0.44   34.25

> system.time(tic(N))
   user  system elapsed 
   5.95    0.19    6.13

Upvotes: 2

jblood94
jblood94

Reputation: 17001

Vectorized approaches will be faster than both replicate and going parallel. How much faster depends on whether we are sampling with or without replacement. My benchmarks give a ~125x speedup over replicate for sampling with replacement, and a ~12x speedup for sampling without replacement. Going parallel provides a ~3x speedup.

library(Rfast)
library(collapse)

With replacement:

frexp <- function(m, R, x) {
  cs <- colCumSums(matrix(rexp(m*R), m, R))
  `dim<-`(x[ceiling(setop(cs*length(x), "/", cs[m,] + rexp(R),
                          rowwise = TRUE))], c(m, R))
}

Without replacement:

fshuffle <- function(m, R, x) {
  colSort(colShuffle(matrix(x, length(x), R))[1:m,])
}

Demonstrating:

frexp(5, 4, 1:35)
#>      [,1] [,2] [,3] [,4]
#> [1,]   19    6    4   12
#> [2,]   28   10    6   32
#> [3,]   28   15   14   33
#> [4,]   33   34   21   33
#> [5,]   34   34   30   35
fshuffle(5, 4, 1:35)
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6    9    7
#> [2,]   12   12   13    9
#> [3,]   15   14   16   19
#> [4,]   22   19   25   21
#> [5,]   31   26   32   27

Check that frexp is sampling 1:35 uniformly:

tabulate(frexp(5, 1e6, 1:35))
#>  [1] 142924 142890 142287 142463 143604 142237 142681 143141 142824 143109
#> [11] 143370 142697 142561 142554 142922 143104 142446 143135 142595 143220
#> [21] 142642 143360 142728 143148 143230 143062 143508 143177 142520 143071
#> [31] 142198 142591 142396 142840 142765

Benchmark against replicate and future.apply::future_replicate:

library(future.apply)
plan(multisession, workers = 15L)

bench::mark(
  frexp = frexp(5, 1e6, 1:35),
  fshuffle = fshuffle(5, 1e6, 1:35),
  future_replicate = future_replicate(1e6, sample(1:35, 5L, replace = TRUE)),
  replicate = replicate(1e6, sort(sample(1:35, size = 5))),
  check = FALSE
)
#> # A tibble: 4 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 frexp            281.67ms 288.29ms    3.47    209.81MB    1.73 
#> 2 fshuffle            2.86s    2.86s    0.350     1.04GB    1.05 
#> 3 future_replicate   10.26s   10.26s    0.0974  259.05MB    0.585
#> 4 replicate          35.36s   35.36s    0.0283  267.79MB    3.76

Upvotes: 2

Friede
Friede

Reputation: 8087

TTBOMK, replicate is a wrapper for sapply, which itself is a wrapper for lapply. There are plenty of options; simple might be:

 library(future.apply)
#> Loading required package: future
 plan(multisession, workers = 6L)
 i = seq(35L)
 system.time({ X = future_replicate(1e6, sample(i, 5L, replace = TRUE)) |> t() })
#>    user  system elapsed 
#>   3.122   0.141   4.619
 head(X)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]   26   18   31    9   35
#> [2,]   25   10   10   24   16
#> [3,]    9    9   35    4   25
#> [4,]   27   32   26   15   27
#> [5,]   13    2    6   26   24
#> [6,]   26   23   12    7   26
 library(Rfast)
 rowSort(head(X))
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    9   18   26   31   35
#> [2,]   10   10   16   24   25
#> [3,]    4    9    9   25   35
#> [4,]   15   26   27   27   32
#> [5,]    2    6   13   24   26
#> [6,]    7   12   23   26   26

Upvotes: 3

Related Questions