Reputation: 5366
from scipy import stats
import numpy as np
class your_distribution(stats.rv_continuous):
def _pdf(self, x):
p0 = 10.9949
p1 = 0.394447
p2 = 12818.4
p3 = 2.38898
return ((p1*p3)/(p3*p0+p2*p1))*((p0*np.exp(-1.0*p1*x))+(p2*np.exp(-1.0*p3*x)))
distribution = your_distribution(a=0.15, b=10.1)
sample = distribution.rvs(size=50000)
The code above generates 50000 samples from a normalized pdf in the range 0.15 to 10.1. However, there is a disproportionately large number of samples generated at the upper bound b=10.1
. This does not make sense, as seen when the pdf is plotted.
How would I fix this issue?
Upvotes: 2
Views: 1142
Reputation: 23637
The PDF is correctly normalized for the full range of the distribution. However, setting a
and b
simply cuts the PDF without any re-normalization. With (a=0.15, b=10.1)
the PDF no longer integrates to 1, and by a quirk of the scipy implementation the remaining density is apparently added at the end of the range. This causes the large number of samples at the upper bound.
We can visualize what is going on by plotting the cumulative density function (CDF) for a=0 and a=0.15:
x = np.linspace(0, 15, 1000)
distribution = your_distribution(a=0.0, b=10.1)
plt.plot(x, distribution.cdf(x), label='a=0')
distribution = your_distribution(a=0.15, b=10.1)
plt.plot(x, distribution.cdf(x), label='a=0.15')
plt.legend()
To get rid of the jump in the CDF and the spurious samples at the upper range we need to re-normalize the PDF for the a..b range. I am too lazy to analytically work out the correct factor, so let's make scipy do the hard work:
from scipy import stats
from scipy.integrate import quad
import numpy as np
# I pulled the definition of the PDF out of the class so we can use it to
# compute the scale factor.
def pdf(x):
p0 = 10.9949
p1 = 0.394447
p2 = 12818.4
p3 = 2.38898
return ((p1*p3)/(p3*p0+p2*p1))*((p0*np.exp(-1.0*p1*x))+(p2*np.exp(-1.0*p3*x)))
class your_distribution(stats.rv_continuous):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# integrate area of the PDF in range a..b
self.scale, _ = quad(pdf, self.a, self.b)
def _pdf(self, x):
return pdf(x) / self.scale # scale PDF so that it integrates to 1 in range a..b
distribution = your_distribution(a=0.15, b=10.1)
sample = distribution.rvs(size=1000)
If you happen to know the analytical solution of the integral you can use it instead of the call to quad
.
Upvotes: 7