Simd
Simd

Reputation: 21343

How to sample from a distribution given the CDF in Python

I would like to draw samples from a probability distribution with CDF 1 - e^(-x^2).

Is there a method in python/scipy/etc. to enable you to sample from a probability distribution given only its CDF?

Upvotes: 4

Views: 7676

Answers (4)

stp
stp

Reputation: 21

Using interpolation when the analytical form might not be known,

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 3, 512)
cdf = 1 - np.exp(-x**2)
cdf_y_samples = np.random.uniform(cdf[0], cdf[-1], size=int(1e5))
cdf_x_samples = np.interp(cdf_y_samples, cdf, x) # apply inverse CDF

plt.figure(1); plt.clf()
plt.hist(cdf_x_samples, bins=50, density=True)

Sample distribution

Upvotes: 0

therealvirtuoso
therealvirtuoso

Reputation: 75

My answer builds off of the answer from @Heike and @Jeff N 's answers (which were both correct, don't get me wrong). However, I had to use a mix of defining both the rv_continuous._ppf, and rv_continuous._pdf, to handle larger sample values, and this solution is quite fast also.

For my case, I had to simulate random variables for over 50,000 values. When only rv_continuous._pdf was defined, the output took very long, and gave a ValueError, encountering a NaN value, and stopping the runner.

When only rv_continuous._ppf was defined, the line graph of the function had issues plotting, as it surpassed the recursion limit in Python (although it is possible to set a higher recursion limit, it did not fix that issue for me).

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

class dist_gen(stats.rv_continuous):
  def __init__(self, seed=233423):
        super().__init__(seed=seed)
  
  # Both _pdf and _ppf overriden in scipy.stats.rv_continuous
  def _pdf(self, x):
    return 0.5*np.exp(-1*(abs(x)))
  
  def _ppf(self, x):
    return -1*np.sign(x-0.5)*np.log(1-(2*abs(x-0.5)))

if __name__ == "__main__":
  # Number of variables to be simulated
  N = 100000
  dist = dist_gen()

  samples = dist.rvs(size=N)

  # Plot histogram
  fig, ax = plt.subplots()
  ax.hist(samples, bins=200, density=True)

  # Plotting the line function
  domain = np.linspace(-7.5, 7.5)
  ax.set_ylim(0, 0.75)
  ax.plot(domain, dist.pdf(domain), color='purple')

  # Showing the final plot
  plt.show()

Distribution Plot

Upvotes: 1

Jeff N
Jeff N

Reputation: 65

Inverse Transform Sampling

To add on to the solution by Heike, you could use Inverse Transform Sampling to sample via the CDF:

import math, random
import matplotlib.pyplot as plt

def inverse_cdf(y):
    # Computed analytically
    return math.sqrt(math.log(-1/(y - 1)))

def sample_distribution():
    uniform_random_sample = random.random()
    return inverse_cdf(uniform_random_sample)

x = [sample_distribution() for i in range(10000)]
plt.hist(x, bins=50)
plt.show()

How SciPy Does It

I was very curious to see how this worked in SciPy, too. It actually looks like it does something very similar to the above. Based on the SciPy docs:

The default method _rvs relies on the inverse of the cdf, _ppf, applied to a uniform random variate. In order to generate random variates efficiently, either the default _ppf needs to be overwritten (e.g. if the inverse cdf can expressed in an explicit form) or a sampling method needs to be implemented in a custom _rvs method.

And based on the SciPy source code, the _ppf (i.e., the inverse of the CDF) does in fact look to be approximated numerically if not specified explicitly. Very cool!

Upvotes: 5

Heike
Heike

Reputation: 24430

To create a custom random variable class given a CDF you could subclass scipy.rv_continuous and override rv_continuous._cdf. This will then automatically generate the corresponding PDF and other statistical information about your distribution, e.g.

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

class MyRandomVariableClass(stats.rv_continuous):
    def __init__(self, xtol=1e-14, seed=None):
        super().__init__(a=0, xtol=xtol, seed=seed)

    def _cdf(self, x):
        return 1-np.exp(-x**2)


if __name__ == "__main__":
    my_rv = MyRandomVariableClass()

    # sample distribution
    samples = my_rv.rvs(size = 1000)

    # plot histogram of samples
    fig, ax1 = plt.subplots()
    ax1.hist(list(samples), bins=50)

    # plot PDF and CDF of distribution
    pts = np.linspace(0, 5)
    ax2 = ax1.twinx()
    ax2.set_ylim(0,1.1)
    ax2.plot(pts, my_rv.pdf(pts), color='red')
    ax2.plot(pts, my_rv.cdf(pts), color='orange')

    fig.tight_layout()
    plt.show()

Upvotes: 9

Related Questions