Reputation: 8863
I'm drawing many sorted samples of size k (typically ~5) from 1:35
. The code I am using is:
replicate(1000000, sort(sample(1:35, size = 5)) )
But this is rather slow. I assume that since I am using replicate
, no parallelisation is happening. Is there a faster way to write this code?
Upvotes: 4
Views: 105
Reputation: 102710
We can still stick to base R and make it faster if we apply order
outside replicate
, rather than sort
inside, for example
set.seed(0)
N <- 1000000
mohan <- \(N) replicate(N, sort(sample(1:35, size = 5)))
tic <- \(N) {
v <- replicate(N, sample(1:35, size = 5))
`dim<-`(v[order(col(v), v)], dim(v))
}
microbenchmark(
mohan(N),
tic(N),
times = 10L,
unit = "relative"
)
and you can see that
Unit: relative
expr min lq mean median uq max neval
mohan(N) 5.334456 5.036718 4.921129 4.884663 4.647009 4.590792 10
tic(N) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
and
> system.time(mohan(N))
user system elapsed
33.66 0.44 34.25
> system.time(tic(N))
user system elapsed
5.95 0.19 6.13
Upvotes: 2
Reputation: 17001
Vectorized approaches will be faster than both replicate
and going parallel. How much faster depends on whether we are sampling with or without replacement. My benchmarks give a ~125x speedup over replicate
for sampling with replacement, and a ~12x speedup for sampling without replacement. Going parallel provides a ~3x speedup.
library(Rfast)
library(collapse)
With replacement:
frexp <- function(m, R, x) {
cs <- colCumSums(matrix(rexp(m*R), m, R))
`dim<-`(x[ceiling(setop(cs*length(x), "/", cs[m,] + rexp(R),
rowwise = TRUE))], c(m, R))
}
Without replacement:
fshuffle <- function(m, R, x) {
colSort(colShuffle(matrix(x, length(x), R))[1:m,])
}
Demonstrating:
frexp(5, 4, 1:35)
#> [,1] [,2] [,3] [,4]
#> [1,] 19 6 4 12
#> [2,] 28 10 6 32
#> [3,] 28 15 14 33
#> [4,] 33 34 21 33
#> [5,] 34 34 30 35
fshuffle(5, 4, 1:35)
#> [,1] [,2] [,3] [,4]
#> [1,] 1 6 9 7
#> [2,] 12 12 13 9
#> [3,] 15 14 16 19
#> [4,] 22 19 25 21
#> [5,] 31 26 32 27
Check that frexp
is sampling 1:35
uniformly:
tabulate(frexp(5, 1e6, 1:35))
#> [1] 142924 142890 142287 142463 143604 142237 142681 143141 142824 143109
#> [11] 143370 142697 142561 142554 142922 143104 142446 143135 142595 143220
#> [21] 142642 143360 142728 143148 143230 143062 143508 143177 142520 143071
#> [31] 142198 142591 142396 142840 142765
Benchmark against replicate
and future.apply::future_replicate
:
library(future.apply)
plan(multisession, workers = 15L)
bench::mark(
frexp = frexp(5, 1e6, 1:35),
fshuffle = fshuffle(5, 1e6, 1:35),
future_replicate = future_replicate(1e6, sample(1:35, 5L, replace = TRUE)),
replicate = replicate(1e6, sort(sample(1:35, size = 5))),
check = FALSE
)
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 frexp 281.67ms 288.29ms 3.47 209.81MB 1.73
#> 2 fshuffle 2.86s 2.86s 0.350 1.04GB 1.05
#> 3 future_replicate 10.26s 10.26s 0.0974 259.05MB 0.585
#> 4 replicate 35.36s 35.36s 0.0283 267.79MB 3.76
Upvotes: 2
Reputation: 8087
TTBOMK, replicate
is a wrapper for sapply
, which itself is a wrapper for lapply
. There are plenty of options; simple might be:
library(future.apply)
#> Loading required package: future
plan(multisession, workers = 6L)
i = seq(35L)
system.time({ X = future_replicate(1e6, sample(i, 5L, replace = TRUE)) |> t() })
#> user system elapsed
#> 3.122 0.141 4.619
head(X)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 26 18 31 9 35
#> [2,] 25 10 10 24 16
#> [3,] 9 9 35 4 25
#> [4,] 27 32 26 15 27
#> [5,] 13 2 6 26 24
#> [6,] 26 23 12 7 26
library(Rfast)
rowSort(head(X))
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 9 18 26 31 35
#> [2,] 10 10 16 24 25
#> [3,] 4 9 9 25 35
#> [4,] 15 26 27 27 32
#> [5,] 2 6 13 24 26
#> [6,] 7 12 23 26 26
Upvotes: 3