Reputation: 399
I have a list containing random variables X and the fractions they occur; so if I plot these I get a probability density function. I am wondering how I can then use this probability density function to generate some random numbers?
I've used scipy.interpolate.CubicSpline to obtain a Python function for this data. How do I use this function to generate random numbers?
Upvotes: 2
Views: 2730
Reputation: 21
To rephrase your question, you have come up with a pdf ("a list containing random variables X and the fractions they occur") and want to know how you can draw random samples from a distribution that has this pdf. There are 2 ways (i know of) to do this depending on how formal you want to be.
TLDR: For simple cases use the NumPy implementation as it's clean, simple and fast. If you want a more formal version because you're using a larger statistical framework then maybe the SciPy version fits better.
SciPy
If you want it to fit into the SciPy distribution framework then you can use the rv_discrete
class and extend it. In your case this would look like:
from scipy.stats import rv_discrete
# these are your variables X
vals = [1, 2, 3]
# these are the fractions they occur
probs = [0.2, 0.5, 0.3]
# define discrete distribution
distrib = rv_discrete(values=(range(len(vals)), probs))
# sample 10 values from this distribution
distrib.rvs(size=10)
array([1, 0, 1, 2, 1, 1, 0, 1, 1, 1])
# distrib outputs indices in vals, not actual vals
[vals[x] for x in distrib.rvs(size=10)]
[3, 2, 3, 2, 2, 2, 1, 1, 2, 2]
And a quick speed test for good measure:
%timeit [vals[x] for x in distrib.rvs(size=10000)]
2.34 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
NumPy
As mentioned in the comments you can just use NumPy directly using the np.random.choice
function.
import numpy as np
np.random.choice(vals, size=10, p=probs)
array([2, 2, 1, 2, 2, 2, 2, 3, 1, 2])
Although it's not part of the SciPy distribution framework, it is simple and clean and as the below shows faster:
%timeit np.random.choice(vals, size=10000, p=probs)
639 µs ± 204 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Part of the speed improvement of NumPy over SciPy is due to the difference in how they generate the pseudo-random numbers that underlie the sampling process. NumPy has upgraded their default PRNG process to the PCG family of random number generators whereas SciPy are still using Mersenne-Twister. They announced this here and if you are curious about how it works I've written a simple explainer here. There's also a lot more detail here between the experts.
We can see the impact of this speed improvement by passing the NumPy PRNG to SciPy:
# default SciPy
distrib = rv_discrete(values=(range(len(vals)), probs))
%timeit [vals[x] for x in distrib.rvs(size=1000000)]
358 ms ± 204 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# SciPy but we pass NumPy's new PCG PRNG
np_seed = np.random.default_rng(123)
distrib = rv_discrete(values=(range(len(vals)), probs), seed=np_seed)
%timeit [vals[x] for x in distrib.rvs(size=1000000)]
221 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So using NumPy's PRNG is about 1.5x
faster.
Upvotes: 2