statsman
statsman

Reputation: 91

Generate N distinct permutations of a very large list in R?

So I have a vector of numbers, say 1:8000. I want generate exactly n distinct permutations of this vector. What is a way to do this without having to calculate all permutations (since 8000! is huge and can't fit in RAM).

function distinct_permutations(L, N){
  # Return N distinct permutations of L as a list of lists
  return(x)
}

x <- seq(1:8000)

Upvotes: 4

Views: 189

Answers (3)

jblood94
jblood94

Reputation: 16981

For large vectors @polkas is correct that lapply with sample will be more performant.

To add more robustness to the solution already provided:

library(RcppAlgos)

perm_n.safe <- function(v, n) {
  if (n/permuteCount(v) > 0.01) return(as.list(data.frame(t(permuteSample(v, n = n)))))
  k <- 0L
  out <- vector("list", n)
  m <- n
  
  while (k < m) {
    s <- unique(lapply(1:n, function(i) sample(v)))
    out[(k + 1L):(k + length(s))] <- s
    k <- k + length(s)
    n <- ceiling(1.1*n/k)
  }
  
  out
}

This avoids a blind doubling of n and is guaranteed to return n samples.

set.seed(148461194)

microbenchmark::microbenchmark(
  perm_n = perm_n(1:8000, 300),
  perm_n.safe = perm_n.safe(1:8000, 300),
  times = 10
)
#> Unit: milliseconds
#>         expr      min       lq     mean   median       uq      max neval
#>       perm_n 229.9223 231.2907 233.5303 233.0133 235.8125 237.5975    10
#>  perm_n.safe 117.1336 118.6889 123.5767 119.3938 122.4304 147.8789    10

microbenchmark::microbenchmark(
  perm_n = perm_n(1:5, 100),
  perm_n.safe = perm_n.safe(1:5, 100)
)
#> Unit: microseconds
#>         expr   min     lq    mean median    uq    max neval
#>       perm_n 728.8 766.75 908.462 789.85 993.4 3327.8   100
#>  perm_n.safe 168.9 186.90 211.721 200.80 228.4  414.1   100

sum(lengths(perm_n(1:5, 100)) == 5)
#> [1] 92
sum(lengths(perm_n.safe(1:5, 100)) == 5)
#> [1] 100

Upvotes: 1

polkas
polkas

Reputation: 4184

I will implement my first idea from 3 proposed in comments and compared with the already provided by other User. My solution is more universal as any set of values could be provided (other ones have to be treated as indexes), I provide a list of vectors and my solution is faster for many scenarios.

Current solutions:

  • perm_n my own solution powered by sample and unique
  • RcppAlgos::permuteSample
  • arrangements::permutations

EDIT: added additional method

perm_n <- function(vec, n) {
  # sample 2 * N samples as a risk of duplicates for bigger samples
  unique(
    lapply(seq_len(n * 2), function(x) sample(vec, length(vec)))
  )[seq_len(n)]
}

microbenchmark::microbenchmark(
  RcppAlgos::permuteSample(v=10, m=10, n=10),
  perm_n(1:10, 10),
  arrangements::permutations(10, 10, nsample = 10)
)
#> Warning in microbenchmark::microbenchmark(RcppAlgos::permuteSample(v = 10, :
#> less accurate nanosecond times to avoid potential integer overflows
#> Unit: microseconds
#>                                              expr     min       lq      mean
#>  RcppAlgos::permuteSample(v = 10, m = 10, n = 10) 230.174 249.7105 1605.4817
#>                                  perm_n(1:10, 10)  65.559  68.0190  182.8387
#>  arrangements::permutations(10, 10, nsample = 10)   3.649   4.1205  238.7746
#>    median       uq      max neval
#>  1038.961 2292.556 28600.70   100
#>    71.217   82.861 10617.61   100
#>     4.633    7.298 23295.99   100

microbenchmark::microbenchmark(
  RcppAlgos::permuteSample(v=8000, m=8000, n=10),
  perm_n(1:8000, 10),
  arrangements::permutations(8000, 8000, nsample = 10),
  times = 10
)
#> Unit: milliseconds
#>                                                  expr      min         lq
#>  RcppAlgos::permuteSample(v = 8000, m = 8000, n = 10) 199.5939 199.858108
#>                                    perm_n(1:8000, 10)   4.3911   4.428697
#>  arrangements::permutations(8000, 8000, nsample = 10) 593.0577 594.338993
#>       mean     median         uq        max neval
#>  200.44507 200.282540 200.944936 201.960178    10
#>    4.49205   4.457807   4.555182   4.663914    10
#>  600.67245 598.277678 604.146808 619.717132    10

microbenchmark::microbenchmark(
  RcppAlgos::permuteSample(v=8000, m=8000, n=300),
  perm_n(1:8000, 300),
  arrangements::permutations(8000, 8000, nsample = 300),
  times = 3
)
#> Unit: milliseconds
#>                                                   expr        min         lq
#>  RcppAlgos::permuteSample(v = 8000, m = 8000, n = 300)  5939.2831  5965.0268
#>                                    perm_n(1:8000, 300)   136.7174   137.3693
#>  arrangements::permutations(8000, 8000, nsample = 300) 17875.7788 17877.2049
#>       mean     median         uq        max neval
#>   5984.365  5990.7704  6006.9057  6023.0410     3
#>    138.048   138.0212   138.7133   139.4053     3
#>  17891.089 17878.6310 17898.7438 17918.8566     3

Created on 2022-11-23 with reprex v2.0.2

Upvotes: 2

jay.sf
jay.sf

Reputation: 72758

Using RcppAlgos::permuteSample, here demonstrated with n=3 samples of 10 out of 10.

set.seed(42)
RcppAlgos::permuteSample(v=10, m=10, n=3)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    9    8    5    3    4    1   10    6    2     7
# [2,]    1    3    5    9    7    6    8   10    2     4
# [3,]    2    8    5    9    4    7    3   10    6     1

Upvotes: 3

Related Questions