Reputation: 13332
I would like to be able to generate random numbers with a probability density function that comes from a drawn curve. These two below have the same area under the curve but should produce lists of random numbers with different characteristics.
My intuition is that one way would be to do it is to sample the curve, and then use the areas of those rectangles to feed an np.random.choice
to pick a range to do an ordinary random in the range of that rectangle's range.
This doesn't feel like a very efficient way to do it. Is there a more 'correct' way to do it?
I had a crack at actually doing it:
import matplotlib.pyplot as plt
import numpy as np
areas = [4.397498, 4.417111, 4.538467, 4.735034, 4.990129, 5.292455, 5.633938,
6.008574, 6.41175, 5.888393, 2.861898, 2.347887, 2.459234, 2.494357,
2.502986, 2.511614, 2.520243, 2.528872, 2.537501, 2.546129, 7.223747,
7.223747, 2.448148, 1.978746, 1.750221, 1.659351, 1.669999]
divisons = [0.0, 0.037037, 0.074074, 0.111111, 0.148148, 0.185185, 0.222222,
0.259259, 0.296296, 0.333333, 0.37037, 0.407407, 0.444444, 0.481481,
0.518519, 0.555556, 0.592593, 0.62963, 0.666667, 0.703704, 0.740741,
0.777778, 0.814815, 0.851852, 0.888889, 0.925926, 0.962963, 1.0]
weights = [a/sum(areas) for a in areas]
indexes = np.random.choice(range(len(areas)), 50000, p=weights)
samples = []
for i in indexes:
samples.append(np.random.uniform(divisons[i], divisons[i+1]))
binwidth = 0.02
binSize = np.arange(min(samples), max(samples) + binwidth, binwidth)
plt.hist(samples, bins=binSize)
plt.xlim(xmax=1)
plt.show()
The method seems to work, but is a bit heavy!
Upvotes: 5
Views: 2882
Reputation: 189
Your intuition to sample with weights proportional to the density is fine as an approximation. I recommend to use tools to invert the distribution function after integrating the density. Here is an example:
import numpy as np
from scipy.stats.sampling import NumericalInversePolynomial
from matplotlib import pyplot as plt
from scipy.integrate import quad
class MyDist:
def __init__(self, a):
self.a = a
def support(self):
# distribution restricted to 0, 5, can be changed as needed
return (0, 5)
def pdf(self, x):
# this is not a proper pdf, the normalizing
# constant is missing (does not integrate to one)
# this is ok for the method presented here
return x * (x + np.sin(5*x) + 2) * np.exp(-x**self.a)
dist = MyDist(0.5)
gen = NumericalInversePolynomial(dist)
# compute the missing normalizing constant to plot the pdf
const_pdf = quad(dist.pdf, *dist.support())[0]
r = gen.rvs(size=50000)
x = np.linspace(r.min(), r.max(), 500)
# show histogram together with the pdf
plt.plot(x, dist.pdf(x) / const_pdf)
plt.hist(r, density=True, bins=100)
plt.show()
There are more tools to sample from custom continuous or discrete univariate distributions by just providing some information about the distribution, such as the density / pdf. Overview of the different methods: https://docs.scipy.org/doc/scipy/reference/stats.sampling.html
Upvotes: 1
Reputation: 758
I was having trouble with rv_continuous
, so I made my own little routine to sample from any continuous distribution with compact support, e.g. from a sum of two exponentials, or from any known discrete pdf (as asked in the question).
This is essentially @Jan 's solution (a pretty classic solution).
My code is fully self-contained. To adapt it to any other distribution, you only need to change the formula in unnormalized_pdf, and make sure the boundaries of your support are correctly set (in my case from 0 to 10/lambda_max is enough.
import numpy as np
import matplotlib.pyplot as plt
plt.ion()
## The function may be any function, so long as it is with FINITE Support
def unnormalized_pdf(T, lambda1, intercept1, lambda2, intercept2):
return np.exp(-lambda1 * T - intercept1) + np.exp(-lambda2 * T - intercept2)
lambda1, intercept1, lambda2, intercept2 = (
0.0012941708402716523,
8.435217547457713,
0.0063804460354380385,
6.712937938322769,
)
## defining the support of the pdf by hand
x0 = 0
xmax = max(1 / lambda1, 1 / lambda2) * 10
## the more bins, the higher the precision
Nbins = 1000000
xs = np.linspace(x0, xmax, Nbins)
dx = xs[1] - xs[0]
## other way to specify it:
# dx = min(1/lambda1, 1/lambda2)/100
# xs = np.arange(x0, xmax, dx)
## compute the (approximate) pdf and cdf of the thing to sample:
pdf = unnormalized_pdf(xs, lambda1, intercept1, lambda2, intercept2)
normalized_pdf = pdf / pdf.sum()
cdf = np.cumsum(normalized_pdf)
## sampling from the distro
Nsamples = 100000
r = np.random.random(Nsamples)
indices_in_cdf = np.searchsorted(cdf, r)
values_drawn = xs[indices_in_cdf]
histo, bins = np.histogram(values_drawn, 1000, density=True)
plt.semilogy(bins[:-1], histo, label="drawn from distro", color="blue")
plt.semilogy(xs, normalized_pdf / dx, label="exact pdf from which we sample", color="k", lw=3)
plt.legend()
plt.show()
Upvotes: 1
Reputation: 305
Another method is to sample the inverse of the CDF. Then you use a uniform random generator to generate p values on the x-axis of the inverse CDF to generate random draws of your PDF. See this article: http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution
Upvotes: 2
Reputation: 21643
One way to do it is to use rv_continuous from scipy.stats. The straightforward way to begin would be to approximate one of those pdf's with a a collection of splines with rv_continuous. In fact, you can generate pseudorandom deviates by defining either a pdf or a cdf with this thing.
Upvotes: 2
Reputation: 1641
For your case, it seems like histogram-based approach would definitely be easiest since you have a line that the user has drawn.
But since you're just trying to generate random numbers from that distribution, you can use the normalized y-values (sum the y-position of all pixels and divide by the total) as the probability_distribution directly in the function below and just take arrays the size of the number of pixels the user has drawn.
from numpy.random import choice
pde = choice(list_of_candidates, number_of_items_to_pick, p=probability_distribution)
probability_distribution (the normalized pixel y-values) is a sequence in the same order of list_of_candidates (the associated x-values). You can also use the keyword replace=False to change the behavior so that drawn items are not replaced.
This should be much faster since you're not actually generating an entire pde, just drawing random numbers that match the pde.
EDIT: your update looks like a solid approach. If you do want to generate the pde, you might consider investigating numba (http://numba.pydata.org) to vectorize your for loop.
Upvotes: 2