How to make sure that a list of generated numbers follow a uniform distribution

I have a list of 150 numbers from 0 to 149. I would like to use a for loop with 150 iterations in order to generate 150 lists of 6 numbers such that,t in each iteration k, the number k is included as well as 5 different random numbers. For example:

S0 = [0, r1, r2, r3, r4, r5] # r1, r2,..., r5 are random numbers between 0 and 150
S1 = [1, r1', r2', r3', r4', r5'] # r1', r2',..., r5' are new random numbers between 0 and 150
...
S149 = [149, r1'', r2'', r3'', r4'', r5''] 

In addition, the numbers in each list have to be different and with a minimum distance of 5. This is the code I am using:

import random
import numpy as np

final_list = []
for k in range(150):
    S = [k]
    for it in range(5):
        domain = [ele for ele in range(150) if ele not in S]
        d = 0
        x = k
        while d < 5:
            d = np.Infinity
            x = random.sample(domain, 1)[0]
            for ch in S:
                if np.abs(ch - x) < d:
                    d = np.abs(ch - x)
        S.append(x)
    final_list.append(S)

Output:

[[0, 149, 32, 52, 39, 126],
 [1, 63, 16, 50, 141, 79],
 [2, 62, 21, 42, 35, 71],
...
 [147, 73, 38, 115, 82, 47],
 [148, 5, 78, 115, 140, 43],
 [149, 36, 3, 15, 99, 23]]

Now, the code is working but I would like to know if it's possible to force that number of repetitions that each number has through all the iterations is approximately the same. For example, after using the previous code, this plot indicates how many times each number has appeared in the generated lists:

rep

As you can see, there are numbers that have appeared more than 10 times while there are others that have appeared only 2 times. Is it possible to reduce this level of variation so that this plot can be approximated as a uniform distribution? Thanks.

Upvotes: 1

Views: 896

Answers (2)

Cireo
Cireo

Reputation: 4427

You can force it to be precisely uniform, if you so desire.

Apologies for the mix of globals and locals, this seemed the most readable. You would want to rewrite according to how variable your constants are =)

import random

SIZE = 150
SAMPLES = 5

def get_samples():
    pool = list(range(SIZE)) * SAMPLES
    random.shuffle(pool)
    items = []
    for i in range(SIZE):
        selection, pool = pool[:SAMPLES], pool[SAMPLES:]
        item = [i] + selection
        items.append(item)
    return items

Then you will have exactly 5 of each (and one more in the leading position, which is a weird data structure).

>>> set(collections.Counter(vv for v in get_samples() for vv in v).values())                                                                      
{6}

The method above does not guarantee the last 5 numbers are unique, in fact, you would expect ~10/150 to have a duplicate. If that is important, you need to filter your distribution a little more and decide how well you value tight uniformity, duplicates, etc.

If your numbers are approximately what you gave above, you also can patch up the results (fairly) and hope to avoid long search times (not the case for SAMPLES sizes closer to OPTIONS size)

def get_samples():
    pool = list(range(SIZE)) * SAMPLES
    random.shuffle(pool)
    i = 0
    while i < len(pool):
        if i % SAMPLES == 0:
            seen = set()
        v = pool[i]
        if v in seen:  # swap
            dst = random.choice(range(SIZE))
            pool[dst], pool[i] = pool[i], pool[dst]
            i = dst - dst % SAMPLES  # Restart from swapped segment
        else:
            seen.add(v)
            i += 1
    items = []
    for i in range(SIZE):
        selection, pool = pool[:SAMPLES], pool[SAMPLES:]
        assert len(set(selection)) == SAMPLES, selection
        item = [i] + selection
        items.append(item)
    return items

This will typically take less than 5 passes through to clean up any duplicates, and should leave all arrangements satisfying your conditions equally likely.

Upvotes: 0

Amitai Irron
Amitai Irron

Reputation: 2055

First, I am not sure that your assertion that the current results are not uniformly distributed is necessarily correct. It would seem prudent to me to try and examine the histogram over several repetitions of the process, rather than just one.

I am not a statistician, but when I want to approximate uniform distribution (and assuming that the functions in random provide uniform distribution), what I try to do is to simply accept all results returned by random functions. For that, I need to limit the choices given to these functions ahead of calling them. This is how I would go about your task:

import random
import numpy as np

N = 150

def random_subset(n):
    result = []
    cands = set(range(N))
    for i in range(6):
        result.append(n)                  # Initially, n is the number that must appear in the result
        cands -= set(range(n - 4, n + 5)) # Remove candidates less than 5 away 
        n = random.choice(list(cands))    # Select next number
    return result

result = np.array([random_subset(n) for n in range(N)])
print(result)

Simply put, whenever I add a number n to the result set, I take out of the selection candidates, an environment of the proper size, to ensure no number of a distance of less than 5 can be selected in the future.

The code is not optimized (multiple set to list conversions) but it works (as per my uderstanding).

Upvotes: 1

Related Questions