Reputation: 43
I have a problem with optimization of the rejection method of generating continuous random variables. I've got a density: f(x) = 3/2 (1-x^2)
. Here's my code:
import random
import matplotlib.pyplot as plt
import numpy as np
import time
import scipy.stats as ss
a=0 # xmin
b=1 # xmax
m=3/2 # ymax
variables = [] #list for variables
def f(x):
return 3/2 * (1 - x**2) #probability density function
reject = 0 # number of rejections
start = time.time()
while len(variables) < 100000: #I want to generate 100 000 variables
u1 = random.uniform(a,b)
u2 = random.uniform(0,m)
if u2 <= f(u1):
variables.append(u1)
else:
reject +=1
end = time.time()
print("Time: ", end-start)
print("Rejection: ", reject)
x = np.linspace(a,b,1000)
plt.hist(variables,50, density=1)
plt.plot(x, f(x))
plt.show()
ss.probplot(variables, plot=plt)
plt.show()
My first question: Is my probability plot made properly? And the second, what is in the title. How to optimize that method? I would like to get some advice to optimize the code. Now that code takes about 0.5 seconds and there are about 50 000 rejections. Is it possible to reduce the time and number of rejections? If it's needed I can optimize using a different method of generating variables.
Upvotes: 4
Views: 2916
Reputation: 19855
Others have spoken to the probability plotting, I'm going to address the efficiency of the rejection algorithm.
Acceptance/rejection schemes are based on m(x), a "majorizing function". A majorizing function should have two properties: 1) m(x)≥ f(x) ∀ x; and 2) m(x), when scaled to be a distribution, should be easy to generate values from. You went with the constant function m = 3/2, which meets both requirements but does not bound f(x) very closely. Integrated from zero to one, that has an area of 3/2. Your f(x), being a valid density function, has an area of 1. Consequently, ∫f(x)) / ∫m(x)) = 1 / (3/2) = 2/3. In other words, 2/3 of the values you generate from the majorizing function are accepted, and you are rejecting 1/3 of the attempts.
You need an m(x) which provides a tighter bound for f(x). I went with a line which is tangent to f(x) at x = 1/2. With a little bit of calculus to get the slope, I derived m(x) = 15/8 - 3x/2
.
This choice of m(x) has an area of 9/8, so only 1/9 of the values will be rejected. A bit more calculus yielded the inverse transform generator for x's based on this m(x) is x = (5 - sqrt(25 - 24U)) / 4
, where U
is a uniform(0,1) random varible.
Here's an implementation, based off your original version. I wrapped the rejection scheme in a function, and created the values with a list comprehension rather than appending to a list. As you'll see if you run this, it produces a lot fewer rejections than your original version.
import random
import matplotlib.pyplot as plt
import numpy as np
import time
import math
import scipy.stats as ss
a = 0 # xmin
b = 1 # xmax
reject = 0 # number of rejections
def f(x):
return 3.0 / 2.0 * (1.0 - x**2) #probability density function
def m(x):
return 1.875 - 1.5 * x
def generate_x():
global reject
while True:
x = (5.0 - math.sqrt(25.0 - random.uniform(0.0, 24.0))) / 4.0
u = random.uniform(0, m(x))
if u <= f(x):
return x
reject += 1
start = time.time()
variables = [generate_x() for _ in range(100000)]
end = time.time()
print("Time: ", end-start)
print("Rejection: ", reject)
x = np.linspace(a,b,1000)
plt.hist(variables,50, density=1)
plt.plot(x, f(x))
plt.show()
Upvotes: 2
Reputation: 111
Regarding your first question, scipy.stats.probplot compares your sample against the quantiles of the normal distribution. If you'd like it to compare against the quantiles of your f(x)
distribution, check out the dist
parameter of probplot
.
In terms of making this sampling procedure faster, avoiding loops is generally the way to go. Replacing the code between start = ...
and end = ...
with the following resulted in a >20x speedup for me.
n_before_accept_reject = 150000
u1 = np.random.uniform(a, b, size=n_before_accept_reject)
u2 = np.random.uniform(0, m, size=n_before_accept_reject)
variables = u1[u2 <= f(u1)]
reject = n_before_accept_reject - len(variables)
Note that this will give you approximately 100000 accepted samples each time you run it. You could raise the value of n_before_accept_reject
slightly to effectively guarantee that variables
will always have >100000 accepted values, and then just cap the size of variables to return exactly 100000 if necessary.
Upvotes: 2
Reputation: 20080
My first question: Is my probability plot made properly?
No. It is made versus default normal distribution. You have to pack your function f(x)
into class derived from stats.rv_continuous, make it into _pdf method, and pass it to probplot
And the second, what is in the title. How to optimise that method? Is it possible to reduce the time and number of rejections?
Sure, you have the power of NumPy vector abilities at your hands. Don't ever write explicit loops - vectoriz, vectorize and vectorize!
Look at modified code below, not a single loop, everything is done via NumPy vectors. Time went down on my computer for 100000 samples (Xeon, Win10 x64, Anaconda Python 3.7) from 0.19 to 0.003.
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
import time
a = 0. # xmin
b = 1. # xmax
m = 3.0/2.0 # ymax
def f(x):
return 1.5 * (1.0 - x*x) # probability density function
start = time.time()
N = 100000
u1 = np.random.uniform(a, b, N)
u2 = np.random.uniform(0.0, m, N)
negs = np.empty(N)
negs.fill(-1)
variables = np.where(u2 <= f(u1), u1, negs) # accepted samples are positive or 0, rejected are -1
end = time.time()
accept = np.extract(variables>=0.0, variables)
reject = N - len(accept)
print("Time: ", end-start)
print("Rejection: ", reject)
x = np.linspace(a, b, 1000)
plt.hist(accept, 50, density=True)
plt.plot(x, f(x))
plt.show()
ss.probplot(accept, plot=plt) # against normal distribution
plt.show()
Concerning reducing number of rejections, you could sample with 0 rejects doing inverse method, it is cubic equation so it could work with easy
UPDATE
Here is the code to use for probplot:
class my_pdf(ss.rv_continuous):
def _pdf(self, x):
return 1.5 * (1.0 - x*x)
ss.probplot(accept, dist=my_pdf(a=a, b=b, name='my_pdf'), plot=plt)
and you should get something like
Upvotes: 2