Reputation: 14192
Say I have a even-length vector such as this:
v <- c(1,1,1,1,2,2,2,3,3,3,4,5,6,7)
It is 14 elements long. I wish to randomly sample 7 pairs of elements without replacement, but a rule is that no pair should contain two of the same item.
So the following result would be acceptable:
1-2, 1-2, 1-2, 1-3, 3-4, 3-5, 6-7
I am not sure how to do this systematically. Clearly brute force would work, e.g.
set.seed(1)
v=c(1,1,1,1,2,2,2,3,3,3,4,5,6,7)
length(v)
v1<-sample(v)
pairs <- split(v1, ceiling(seq_along(v1)/2))
sapply(pairs, diff)
1 2 3 4 5 6 7
1 1 2 3 -6 -3 3
This shows that no pair has duplicate elements as the difference is always not 0. In my case, I need to do this 1000s of times and it's not so easy to avoid duplicates. Is there a more effective way?
Upvotes: 2
Views: 611
Reputation: 51998
Here is a variation of your "brute-force" approach (better known as "hit-or-miss"):
rand.pairs <- function(v, time.out = 1000){
n <- length(v)
for(i in 1:time.out){
v <- sample(v)
first <- v[1:(n/2)]
second <- v[(n/2+1):n]
if(all(first != second)) return(unname(rbind(first,second)))
}
NULL
}
The point of time.out
is to avoid infinite loops. For some input vectors a solution might be either impossible or too hard to hit upon by chance.
Example run:
> v <- c(1,1,1,1,2,2,2,3,3,3,4,5,6,7)
> set.seed(1234)
> rand.pairs(v)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 6 3 3 7 2 2 5
[2,] 1 4 1 1 3 1 2
It is fast enough to run thousands of times:
> library(microbenchmark)
> microbenchmark(rand.pairs(v))
Unit: microseconds
expr min lq mean median uq max neval
rand.pairs(v) 6.7 7.758 16.17517 12.166 19.747 70.877 100
Your mileage may vary, but if your machine is at all comparable, you should be able to call this function over 50,000 times per second. replicate(10000,rand.pairs(v))
takes much less than a second to run. On the other hand, if you have an input for which the constraints are harder to satisfy, a solution might require more time.
Upvotes: 0
Reputation: 160417
v0 <- table(v)
set.seed(2)
out <- replicate(7, sample(names(v0), size=2, prob=v0))
out
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "1" "2" "4" "1" "3" "2" "6"
# [2,] "5" "1" "7" "7" "2" "1" "1"
I use table(v)
and names(v0)
so that I'm guaranteed the names and the probs are in the same order. (I didn't want to assume that your actual data is structured identically.) If you need integers, then it's easy enough to us as.integer
.
If you literally need 1-2
, then
apply(out, 2, paste, collapse="-")
# [1] "1-5" "2-1" "4-7" "1-7" "3-2" "2-1" "6-1"
I'm confident that this will produce no dupes (because names(v0)
is unique and the default replace=FALSE
), but here's an empirical test:
set.seed(3)
l <- replicate(1e5, sample(unique(v), size=2, prob=table(v)))
any(l[1,] == l[2,])
# [1] FALSE
Upvotes: 3