jalapic
jalapic

Reputation: 14192

Sampling pairs of elements from a vector but without duplication

Say I have a even-length vector such as this:

v <- c(1,1,1,1,2,2,2,3,3,3,4,5,6,7)

It is 14 elements long. I wish to randomly sample 7 pairs of elements without replacement, but a rule is that no pair should contain two of the same item.

So the following result would be acceptable:

1-2, 1-2, 1-2, 1-3, 3-4, 3-5, 6-7

I am not sure how to do this systematically. Clearly brute force would work, e.g.

set.seed(1)
v=c(1,1,1,1,2,2,2,3,3,3,4,5,6,7)
length(v)
v1<-sample(v)
pairs <- split(v1, ceiling(seq_along(v1)/2))
sapply(pairs, diff)

 1  2  3  4  5  6  7 
 1  1  2  3 -6 -3  3 

This shows that no pair has duplicate elements as the difference is always not 0. In my case, I need to do this 1000s of times and it's not so easy to avoid duplicates. Is there a more effective way?

Upvotes: 2

Views: 611

Answers (2)

John Coleman
John Coleman

Reputation: 51998

Here is a variation of your "brute-force" approach (better known as "hit-or-miss"):

rand.pairs <- function(v, time.out = 1000){
  n <- length(v)
  for(i in 1:time.out){
    v <- sample(v)
    first <- v[1:(n/2)]
    second <- v[(n/2+1):n]
    if(all(first != second)) return(unname(rbind(first,second)))
  }
  NULL
}

The point of time.out is to avoid infinite loops. For some input vectors a solution might be either impossible or too hard to hit upon by chance.

Example run:

> v <- c(1,1,1,1,2,2,2,3,3,3,4,5,6,7)
> set.seed(1234)
> rand.pairs(v)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    6    3    3    7    2    2    5
[2,]    1    4    1    1    3    1    2

It is fast enough to run thousands of times:

> library(microbenchmark)
> microbenchmark(rand.pairs(v))
Unit: microseconds
          expr min    lq     mean median     uq    max neval
 rand.pairs(v) 6.7 7.758 16.17517 12.166 19.747 70.877   100

Your mileage may vary, but if your machine is at all comparable, you should be able to call this function over 50,000 times per second. replicate(10000,rand.pairs(v)) takes much less than a second to run. On the other hand, if you have an input for which the constraints are harder to satisfy, a solution might require more time.

Upvotes: 0

r2evans
r2evans

Reputation: 160417

v0 <- table(v)
set.seed(2)
out <- replicate(7, sample(names(v0), size=2, prob=v0))
out
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "1"  "2"  "4"  "1"  "3"  "2"  "6" 
# [2,] "5"  "1"  "7"  "7"  "2"  "1"  "1" 

I use table(v) and names(v0) so that I'm guaranteed the names and the probs are in the same order. (I didn't want to assume that your actual data is structured identically.) If you need integers, then it's easy enough to us as.integer.

If you literally need 1-2, then

apply(out, 2, paste, collapse="-")
# [1] "1-5" "2-1" "4-7" "1-7" "3-2" "2-1" "6-1"

I'm confident that this will produce no dupes (because names(v0) is unique and the default replace=FALSE), but here's an empirical test:

set.seed(3)
l <- replicate(1e5, sample(unique(v), size=2, prob=table(v)))
any(l[1,] == l[2,])
# [1] FALSE

Upvotes: 3

Related Questions