user248237
user248237

Reputation:

getting the most 'diverse' set of pairs from a list in Python?

I have a list in python of length N and I want to choose K pairs of elements from it, where repetition of elements within a pair is not allowed and where (x,y) == (y,x) (order insensitive). There are N choose 2 pairs possible, and K is always significantly smaller than N. What is a good deterministic (no sampling) way to pick the most 'diverse' and representative set of pairs from the list, meaning: (1) set of pairs where highest number of items from the list is represented (and there's not a bias for any particular element), (2) and where the list of pairs is not biased towards the start or end of the list?

example:

l = [1,2,3,4,5]

there are 5 choose 2 = 10 combinations possible. If we ask for 2 pairs (K = 2), a good set of pairs would be [(1,2),(3,4)] because almost every element appears in the list and we don't have repetitions of any element. A bad set of pairs for K = 2 would be: [(1,2),(1,3)] since it's reusing the 1 element and is clearly biased towards the start of list. If K were > 2 in this case we'd need to repeat elements, it's inevitable, but I want to find a way to do this that is representative/diverse wrt list.

I'm just looking for an efficient and simple heuristic, does not have have to be perfect. any ideas?

happy to use numpy/scipy for this.

Upvotes: 0

Views: 1767

Answers (2)

user2566092
user2566092

Reputation: 4661

You need at least pseudo-random sampling of some kind or else there will always be some sort of "bias" when you re-run your pair sampling code, whether toward the beginning or end, or somewhere else. If K is smaller than N/2, and if N is not too large (say 100 million or less) then you can use the following python code, which avoids repetitive sampling calls because it generates K psuedo-random pairs all at once, avoiding duplicates

import random

X = range(N)

random.seed() # uses system time to initialize random number generator, or you can pass in a deterministic seed as an argument if you want

# code to use to generate K pairs
A = random.sample(X,2*K) # now you have a list of 2*K unique elements from 0 to N-1
pairs = zip(A[0:K],A[K:(2*K)]) # now you have your pairs

Now, if K is greater than N/2 then you will have to have duplicates, but you can minimize duplicates similar to above, by simply rerunning code similar to the 2 lines above in a loop. If N is odd this creates headaches but a simple very close approximate strategy is to repeatedly generate floor(N/2) pairs (avoiding duplicates) and just leave one number unused each time. Code follows:

pairs = []
M = N
if M % 2 == 1:
  M -= 1
while len(pairs) < K:
  B = random.sample(X,M)
  A = zip(B[0:(M/2)],B[(M/2):M])
  pairs.extend(A)
pairs = pairs[0:K]

Upvotes: 1

David Eisenstat
David Eisenstat

Reputation: 65458

Take the first K matches from a round-robin tournament? Permute the list by a pseudorandom Fisher–Yates shuffle with a deterministic seed to avoid bias toward the ends.

Upvotes: 0

Related Questions