Reputation: 4792
I am trying to generate a random sample of multiple variables which are loosely related to each other. Meaning that "allowed" values of some variables depend on the value which is set for another variable.
For simplicity let's imagine that I have just two variables - A and B and let's say that both of them have uniform or gaussian distribution (we don't really care which exact distribution they follow and can accept both). For discussion let's assume both have uniform distribution.
Let's say that variable A can take any value between 0 and 100. We can easily sample from this distribution, say, 1000 data points.
Now, we also want to generate values for variable B, which can take any value between, say, 50 and 150. The catch here is that there is a constraint in resulting sample - sum of values A and B must be between 60 and 160.
Final catch is that each time we run the sampling process precise boundaries of sampling are changing (for example in one case A can be between 0 and 100 as above, next day it needs to be between -10 and 75 etc). Basically from day to day precise boundaries of sampling are evolving.
Right now we do it in a very inefficient way - generate completely random grid of A and B values independently, than eliminate all of the A and B combinations which don't satisfy constraints which we specify and than use them in subsequent steps. For example such grid could look like:
However, as you guess it is super-inefficient. In reality we have a lot of variables (30+) and large set of constraints we apply. Completely random generation of grid leads to instances where after applying all constraints we end up with no points satisfying all constraints if we don't use large enough sample size - and to ensure we always have at least some points we need to generate grid with millions points. Beyond that each time we re-run the sampling procedure we get different resulting dataset - sometimes all points are getting eliminated, sometimes we get 10 points as result and sometimes - 1000.
So my question is - is there a way to do it more efficiently in a "statistically correct way", ideally in a way which will allow us to specify how many sample points satisfying all constraints we want to get in the end of the day. Any guidance or pointers to some code examples will be much appreciated.
Upvotes: 0
Views: 463
Reputation: 1002
I'm not sure there is an entirely different approach to what you are doing (which is kind of Rejection Sampling). But you could definitely do it in a more efficient way than you describe, e.g. not generate lots of combinations beforehand and reject them once after generation.
Maybe this could help:
Define boundaries of your variables, and a function that evaluates the constraints that you put on them. Here I am using the values from your example. More variables and constraints can be added easily.
minima = [0, 50]
maxima = [100, 150]
def constraints(a, b):
# input are arrays of random numbers for each variable
# returns boolean mask for indexing
return ((a + b) > 60) & ((a + b) < 160)
Then you could generate batches of random numbers and evaluate in a vectorised way whether the constraints are fulfilled. Depending on the dimensionality and complexity of your constraints this might reject plenty of values, but at least you don't store them all in advance and you can define the desired number of samples.
def sample_numbers(constraints, num_samples, minima, maxima, batch=10000):
samples = np.zeros(shape=(num_samples + batch, len(minima)), dtype='int64')
n_accept = 0
while n_accept < num_samples:
# sample from discrete uniform distributions
a = scipy.stats.randint.rvs(low=minima[0], high=maxima[0], size=batch)
b = scipy.stats.randint.rvs(low=minima[1], high=maxima[1], size=batch)
# vectorised check where the constraints are fulfilled
evaluate_constraints = constraints(a, b)
# number of accepted combinations in this batch
n_accept_update = n_accept + sum(evaluate_constraints)
# transfer accepted combinations
samples[n_accept: n_accept_update] = np.stack((a[evaluate_constraints], b[evaluate_constraints])).T
n_accept = n_accept_update
return samples[:num_samples]
sampled_numbers = sample_numbers(constraints=constraints, num_samples=100000, minima=minima, maxima=maxima, batch=1000)
Upvotes: 1