ranky123
ranky123

Reputation: 399

How to generate random numbers according to a custom probability density function (Python)?

I have a list containing random variables X and the fractions they occur; so if I plot these I get a probability density function. I am wondering how I can then use this probability density function to generate some random numbers?

I've used scipy.interpolate.CubicSpline to obtain a Python function for this data. How do I use this function to generate random numbers?

Upvotes: 2

Views: 2730

Answers (1)

mjam03
mjam03

Reputation: 21

To rephrase your question, you have come up with a pdf ("a list containing random variables X and the fractions they occur") and want to know how you can draw random samples from a distribution that has this pdf. There are 2 ways (i know of) to do this depending on how formal you want to be.

TLDR: For simple cases use the NumPy implementation as it's clean, simple and fast. If you want a more formal version because you're using a larger statistical framework then maybe the SciPy version fits better.

SciPy

If you want it to fit into the SciPy distribution framework then you can use the rv_discrete class and extend it. In your case this would look like:

from scipy.stats import rv_discrete

# these are your variables X
vals = [1, 2, 3]
# these are the fractions they occur
probs = [0.2, 0.5, 0.3]

# define discrete distribution
distrib = rv_discrete(values=(range(len(vals)), probs)) 

# sample 10 values from this distribution
distrib.rvs(size=10)
array([1, 0, 1, 2, 1, 1, 0, 1, 1, 1])

# distrib outputs indices in vals, not actual vals
[vals[x] for x in distrib.rvs(size=10)]
[3, 2, 3, 2, 2, 2, 1, 1, 2, 2]

And a quick speed test for good measure:

%timeit [vals[x] for x in distrib.rvs(size=10000)]
2.34 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

NumPy

As mentioned in the comments you can just use NumPy directly using the np.random.choice function.

import numpy as np
np.random.choice(vals, size=10, p=probs)
array([2, 2, 1, 2, 2, 2, 2, 3, 1, 2])

Although it's not part of the SciPy distribution framework, it is simple and clean and as the below shows faster:

%timeit np.random.choice(vals, size=10000, p=probs)
639 µs ± 204 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Part of the speed improvement of NumPy over SciPy is due to the difference in how they generate the pseudo-random numbers that underlie the sampling process. NumPy has upgraded their default PRNG process to the PCG family of random number generators whereas SciPy are still using Mersenne-Twister. They announced this here and if you are curious about how it works I've written a simple explainer here. There's also a lot more detail here between the experts.

We can see the impact of this speed improvement by passing the NumPy PRNG to SciPy:

# default SciPy
distrib = rv_discrete(values=(range(len(vals)), probs))
%timeit [vals[x] for x in distrib.rvs(size=1000000)]
358 ms ± 204 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# SciPy but we pass NumPy's new PCG PRNG
np_seed = np.random.default_rng(123)
distrib = rv_discrete(values=(range(len(vals)), probs), seed=np_seed)
%timeit [vals[x] for x in distrib.rvs(size=1000000)]
221 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So using NumPy's PRNG is about 1.5x faster.

Upvotes: 2

Related Questions