Veltzer Doron
Veltzer Doron

Reputation: 974

Generate random natural numbers that sum to a given number and comply to a set of general constraints

I had an application that required something similar to the problem described here.

I too need to generate a set of positive integer random variables {Xi} that add up to a given sum S, where each variable might have constraints such as mi<=Xi<=Mi.

This I know how to do, the problem is that in my case I also might have constraints between the random variables themselves, say Xi<=Fi(Xj) for some given Fi (also lets say Fi's inverse is known), Now, how should one generate the random variables "correctly"? I put correctly in quotes here because I'm not really sure what it would mean here except that I want the generated numbers to cover all possible cases with as uniform a probability as possible for each possible case.

Say we even look at a very simple case: 4 random variables X1,X2,X3,X4 that need to add up to 100 and comply with the constraint X1 <= 2*X2, what would be the "correct" way to generate them?

P.S. I know that this seems like it would be a better fit for math overflow but I found no solutions there either.

Upvotes: 1

Views: 838

Answers (3)

Veltzer Doron
Veltzer Doron

Reputation: 974

I have an answer to my question, under a general set of constraints what I do is:

  • Sample the constraints in order to evaluate s, the constrained area.
  • If s is big enough then generate random samples and throw out those that do not comply to the constraints as described in my previous answer.
  • Otherwise:
    1. Enumerate the entire simplex.
    2. Apply the constraints to filter out all tuples outside the constrained area.
    3. List the resulting filtered tuples.
    4. When asked to generate, I generate by choosing uniformly from this result list. (note: this is worth my effort only because I'm asked to generate very often)
  • A combination of these two strategies should cover most cases.

Note: I also had to handle cases where S was a randomly generated parameter (m < S < M) in which case I simply treat it as another random variable constrained between m and M and I generate it together with the rest of the variables and handle it as I described earlier.

Upvotes: 0

Veltzer Doron
Veltzer Doron

Reputation: 974

Ok, so I have this solution for my actual question where I generate 9000 triplets of 3 random variables by joining zeros to sorted random tuple arrays and finally ones and then taking their differences as suggested in the answer on SO I mentioned in my original question.

Then I simply filter out the ones that don't match my constraints and plot them.

S = 100

def generate(n=9000):
    uv = np.hstack([np.zeros([n, 1]),
                    np.sort(np.random.rand(n, 2), axis=1),
                    np.ones([n,1])])
    return np.diff(uv, axis=1)

a = generate()

def plotter(a):
    fig = plt.figure(figsize=(10, 10), dpi=100)
    ax = fig.add_subplot(projection='3d')

    surf = ax.scatter(*zip(*a), marker='o', color=a / 100)
    ax.view_init(elev=25., azim=75)
    
    ax.set_xlabel('$A_1$', fontsize='large', fontweight='bold')
    ax.set_ylabel('$A_2$', fontsize='large', fontweight='bold')
    ax.set_zlabel('$A_3$', fontsize='large', fontweight='bold')
    lim = (0, S);
    ax.set_xlim3d(*lim);
    ax.set_ylim3d(*lim);
    ax.set_zlim3d(*lim)
    plt.show()

b = a[(a[:, 0] <= 3.5 * a[:, 1] + 2 * a[:, 2]) &\
      (a[:, 1] >= (a[:, 2])),:] * S
plotter(b.astype(int))

enter image description here

As you can see, the distribution is uniformly distributed over these arbitrary limits on the simplex but I'm still not sure if I could forego throwing away samples that don't adhere to the constraints (work the constraints somehow into the generation process? I'm almost certain now that it can't be done for general {Fi}). This could be useful in the general case if your constraints limit your sampled area to a very small subarea of the entire simplex (since resampling like this means that to sample from the constrained area a you need to sample from the simplex an order of 1/a times).

If someone has an answer to this last question I will be much obliged (will change the selected answer to his).

Upvotes: 0

Severin Pappadeux
Severin Pappadeux

Reputation: 20080

For 4 random variables X1,X2,X3,X4 that need to add up to 100 and comply with the constraint X1 <= 2*X2, one could use multinomial distribution

As soon as probability of the first number is low enough, your condition would be almost always satisfied, if not - reject and repeat. And multinomial distribution by design has the sum equal to 100.

Code, Windows 10 x64, Python 3.8

import numpy as np

def x1x2x3x4(rng):
    while True:
        v = rng.multinomial(100, [0.1, 1/2-0.1, 1/4, 1/4])
        if v[0] <= 2*v[1]:
            return v

    return None

rng = np.random.default_rng()

print(x1x2x3x4(rng))
print(x1x2x3x4(rng))
print(x1x2x3x4(rng))

UPDATE

Lots of freedom in selecting probabilities. E.g., you could make other (##2, 3, 4) symmetric. Code

def x1x2x3x4(rng, pfirst = 0.1):
    pother = (1.0 - pfirst)/3.0
    while True:
        v = rng.multinomial(100, [pfirst, pother, pother, pother])
        if v[0] <= 2*v[1]:
            return v

    return None

UPDATE II

If you start rejecting combinations, then you artificially bump probabilities of one subset of events and lower probabilities of another set of events - and total sum is always 1. There is NO WAY to have uniform probabilities with conditions you want to meet. Code below runs with multinomial with equal probabilities and computes histograms and mean values. Mean supposed to be exactly 25 (=100/4), but as soon as you reject some samples, you lower mean of first value and increase mean of the second value. Difference is small, but UNAVOIDABLE. If it is ok with you, so be it. Code

import numpy as np
import matplotlib.pyplot as plt

def x1x2x3x4(rng, summa, pfirst = 0.1):
    pother = (1.0 - pfirst)/3.0
    while True:
        v = rng.multinomial(summa, [pfirst, pother, pother, pother])
        if v[0] <= 2*v[1]:
            return v
    return None

rng = np.random.default_rng()

s = 100
N = 5000000

# histograms
first = np.zeros(s+1)
secnd = np.zeros(s+1)
third = np.zeros(s+1)
forth = np.zeros(s+1)

mfirst = np.float64(0.0)
msecnd = np.float64(0.0)
mthird = np.float64(0.0)
mforth = np.float64(0.0)

for _ in range(0, N): # sampling with equal probabilities
    v = x1x2x3x4(rng, s, 0.25)

    q = v[0]
    mfirst   += np.float64(q)
    first[q] += 1.0

    q = v[1]
    msecnd   += np.float64(q)
    secnd[q] += 1.0

    q = v[2]
    mthird   += np.float64(q)
    third[q] += 1.0

    q = v[3]
    mforth   += np.float64(q)
    forth[q] += 1.0

x = np.arange(0, s+1, dtype=np.int32)

fig, axs = plt.subplots(4)
axs[0].stem(x, first, markerfmt=' ')
axs[1].stem(x, secnd, markerfmt=' ')
axs[2].stem(x, third, markerfmt=' ')
axs[3].stem(x, forth, markerfmt=' ')
plt.show()

print((mfirst/N, msecnd/N, mthird/N, mforth/N))

prints

(24.9267492, 25.0858356, 24.9928602, 24.994555)

NB! As I said, first mean is lower and second is higher. Histograms are a little bit different as well

enter image description here

UPDATE III

Ok, Dirichlet, so be it. Lets compute mean values of your generator before and after the filter. Code

import numpy as np

def generate(n=10000):
    uv = np.hstack([np.zeros([n, 1]),
                    np.sort(np.random.rand(n, 2), axis=1),
                    np.ones([n,1])])
    return np.diff(uv, axis=1)

a = generate(1000000)

print("Original Dirichlet sample means")
print(a.shape)
print(np.mean((a[:, 0] * 100).astype(int)))
print(np.mean((a[:, 1] * 100).astype(int)))
print(np.mean((a[:, 2] * 100).astype(int)))

print("\nFiltered Dirichlet sample means")
q = (a[(a[:,0]<=2*a[:,1]) & (a[:,2]>0.35),:] * 100).astype(int)
print(q.shape)

print(np.mean(q[:, 0]))
print(np.mean(q[:, 1]))
print(np.mean(q[:, 2]))

I've got

Original Dirichlet sample means
(1000000, 3)
32.833758
32.791228
32.88054

Filtered Dirichlet sample means
(281428, 3)
13.912784086871243
28.36360987535
56.23109285501087

Do you see the difference? As soon as you apply any kind of filter, you alter the distribution. Nothing is uniform anymore

Upvotes: 2

Related Questions