Ben
Ben

Reputation: 13332

Generating random numbers from arbitrary probability density function

I would like to be able to generate random numbers with a probability density function that comes from a drawn curve. These two below have the same area under the curve but should produce lists of random numbers with different characteristics.

enter image description here

My intuition is that one way would be to do it is to sample the curve, and then use the areas of those rectangles to feed an np.random.choice to pick a range to do an ordinary random in the range of that rectangle's range.

enter image description here

This doesn't feel like a very efficient way to do it. Is there a more 'correct' way to do it?

I had a crack at actually doing it:

import matplotlib.pyplot as plt
import numpy as np

areas = [4.397498, 4.417111, 4.538467, 4.735034, 4.990129, 5.292455, 5.633938,
         6.008574, 6.41175, 5.888393, 2.861898, 2.347887, 2.459234, 2.494357,
         2.502986, 2.511614, 2.520243, 2.528872, 2.537501, 2.546129, 7.223747,
         7.223747, 2.448148, 1.978746, 1.750221, 1.659351, 1.669999]
divisons = [0.0, 0.037037, 0.074074, 0.111111, 0.148148, 0.185185, 0.222222,
            0.259259, 0.296296, 0.333333, 0.37037, 0.407407, 0.444444, 0.481481,
            0.518519, 0.555556, 0.592593, 0.62963, 0.666667, 0.703704, 0.740741,
            0.777778, 0.814815, 0.851852, 0.888889, 0.925926, 0.962963, 1.0]
weights = [a/sum(areas) for a in areas]
indexes = np.random.choice(range(len(areas)), 50000, p=weights)
samples = []
for i in indexes:
    samples.append(np.random.uniform(divisons[i], divisons[i+1]))

binwidth = 0.02
binSize = np.arange(min(samples), max(samples) + binwidth, binwidth)
plt.hist(samples, bins=binSize)
plt.xlim(xmax=1)
plt.show()

enter image description here

The method seems to work, but is a bit heavy!

Upvotes: 5

Views: 2882

Answers (5)

Christoph Baumgarten
Christoph Baumgarten

Reputation: 189

Your intuition to sample with weights proportional to the density is fine as an approximation. I recommend to use tools to invert the distribution function after integrating the density. Here is an example:

import numpy as np
from scipy.stats.sampling import NumericalInversePolynomial
from matplotlib import pyplot as plt
from scipy.integrate import quad


class MyDist:
    def __init__(self, a):
        self.a = a

    def support(self):
        # distribution restricted to 0, 5, can be changed as needed
        return (0, 5)

    def pdf(self, x):
        # this is not a proper pdf, the normalizing
        # constant is missing (does not integrate to one)
        # this is ok for the method presented here
        return x * (x + np.sin(5*x) + 2) * np.exp(-x**self.a)


dist = MyDist(0.5)
gen = NumericalInversePolynomial(dist)

# compute the missing normalizing constant to plot the pdf
const_pdf = quad(dist.pdf, *dist.support())[0]

r = gen.rvs(size=50000)
x = np.linspace(r.min(), r.max(), 500)

# show histogram together with the pdf
plt.plot(x, dist.pdf(x) / const_pdf)
plt.hist(r, density=True, bins=100)
plt.show()

Samples and density

There are more tools to sample from custom continuous or discrete univariate distributions by just providing some information about the distribution, such as the density / pdf. Overview of the different methods: https://docs.scipy.org/doc/scipy/reference/stats.sampling.html

Upvotes: 1

François Landes
François Landes

Reputation: 758

I was having trouble with rv_continuous, so I made my own little routine to sample from any continuous distribution with compact support, e.g. from a sum of two exponentials, or from any known discrete pdf (as asked in the question). This is essentially @Jan 's solution (a pretty classic solution).

My code is fully self-contained. To adapt it to any other distribution, you only need to change the formula in unnormalized_pdf, and make sure the boundaries of your support are correctly set (in my case from 0 to 10/lambda_max is enough.

import numpy as np
import matplotlib.pyplot as plt

plt.ion()

## The function may be any function, so long as it is with FINITE Support
def unnormalized_pdf(T, lambda1, intercept1, lambda2, intercept2):
    return np.exp(-lambda1 * T - intercept1) + np.exp(-lambda2 * T - intercept2)


lambda1, intercept1, lambda2, intercept2 = (
    0.0012941708402716523,
    8.435217547457713,
    0.0063804460354380385,
    6.712937938322769,
)

## defining the support of the pdf by hand
x0 = 0
xmax = max(1 / lambda1, 1 / lambda2) * 10

## the more bins, the higher the precision
Nbins = 1000000
xs = np.linspace(x0, xmax, Nbins)
dx = xs[1] - xs[0]
## other way to specify it:
# dx = min(1/lambda1, 1/lambda2)/100
# xs = np.arange(x0, xmax, dx)

## compute the (approximate) pdf and cdf of the thing to sample:
pdf = unnormalized_pdf(xs, lambda1, intercept1, lambda2, intercept2)
normalized_pdf = pdf / pdf.sum()
cdf = np.cumsum(normalized_pdf)

## sampling from the distro
Nsamples = 100000
r = np.random.random(Nsamples)
indices_in_cdf = np.searchsorted(cdf, r)
values_drawn = xs[indices_in_cdf]
histo, bins = np.histogram(values_drawn, 1000, density=True)
plt.semilogy(bins[:-1], histo, label="drawn from distro", color="blue")
plt.semilogy(xs, normalized_pdf / dx, label="exact pdf from which we sample", color="k", lw=3)
plt.legend()
plt.show()

enter image description here

Upvotes: 1

Jan
Jan

Reputation: 305

Another method is to sample the inverse of the CDF. Then you use a uniform random generator to generate p values on the x-axis of the inverse CDF to generate random draws of your PDF. See this article: http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution

Upvotes: 2

Bill Bell
Bill Bell

Reputation: 21643

One way to do it is to use rv_continuous from scipy.stats. The straightforward way to begin would be to approximate one of those pdf's with a a collection of splines with rv_continuous. In fact, you can generate pseudorandom deviates by defining either a pdf or a cdf with this thing.

Upvotes: 2

Greg Jennings
Greg Jennings

Reputation: 1641

For your case, it seems like histogram-based approach would definitely be easiest since you have a line that the user has drawn.

But since you're just trying to generate random numbers from that distribution, you can use the normalized y-values (sum the y-position of all pixels and divide by the total) as the probability_distribution directly in the function below and just take arrays the size of the number of pixels the user has drawn.

from numpy.random import choice
pde = choice(list_of_candidates, number_of_items_to_pick, p=probability_distribution)

probability_distribution (the normalized pixel y-values) is a sequence in the same order of list_of_candidates (the associated x-values). You can also use the keyword replace=False to change the behavior so that drawn items are not replaced.

see numpy docs here

This should be much faster since you're not actually generating an entire pde, just drawing random numbers that match the pde.

EDIT: your update looks like a solid approach. If you do want to generate the pde, you might consider investigating numba (http://numba.pydata.org) to vectorize your for loop.

Upvotes: 2

Related Questions