Reputation: 1568
I have a set of data values, and I want to get the CDF (cumulative distribution function) for that data set.
Since this is a continuous variable, we can't use binning approach as mentioned in (How to get cumulative distribution function correctly for my data in python?). So I came up with following approach.
import scipy.stats as st
def trapezoidal_2(ag, a, b, n):
h = np.float(b - a) / n
s = 0.0
s += ag(a)[0]/2.0
for i in range(1, n):
s += ag(a + i*h)[0]
s += ag(b)[0]/2.0
return s * h
def get_cdf(data):
a = np.array(data)
ag = st.gaussian_kde(a)
cdf = [0]
x = []
k = 0
max_data = max(data)
while (k < max_data):
x.append(k)
k = k + 1
sum_integral = 0
for i in range(1, len(x)):
sum_integral = sum_integral + (trapezoidal_2(ag, x[i - 1], x[i], 2))
cdf.append(sum_integral)
return x, cdf
This is how I use this method.
b = 1
data = st.pareto.rvs(b, size=10000)
data = list(data) x_cdf, y_cdf = get_cdf(data)
Ideally I should get a value close to 1 at the end of y_cdf list. But I get a value close to 0.57.
What is going wrong here? Is my approach correct?
Thanks.
Upvotes: 0
Views: 4300
Reputation: 12618
The value of the cdf at x is the integral of the pdf between -inf and x, but you are computing it between 0 and x. Maybe you are assuming that the pdf is 0 for x < 0 but it is not:
rs = np.random.RandomState(seed=52221829)
b = 1
data = st.pareto.rvs(b, size=10000, random_state=rs)
ag = st.gaussian_kde(data)
x = np.linspace(-100, 100)
plt.plot(x, ag.pdf(x))
So this is probably what's going wrong here: you not checking your assumptions.
Your code for computing the integral is painfully slow, there are better ways to do this with scipy
but gaussian_kde
provides the method integrate_box_1d
to integrate the pdf. If you take the integral from -inf everything looks right.
cdf = np.vectorize(lambda x: ag.integrate_box_1d(-np.inf, x))
plt.plot(x, cdf(x))
Integrating between 0 and x you get the same you are seeing now (to the right of 0), but that's not a cdf at all:
wrong_cdf = np.vectorize(lambda x: ag.integrate_box_1d(0, x))
plt.plot(x, wrong_cdf(x))
Upvotes: 3
Reputation: 16213
I think it's just:
def get_cdf(data):
return sorted(data), np.linspace(0, 1, len(data))
but I might be misinterpreting the question!
when I compare this to the analytic result I get the same:
x_cdf, y_cdf = get_cdf(st.pareto.rvs(1, size=10000))
import matplotlib.pyplot as plt
plt.semilogx(x_cdf, y_cdf)
plt.semilogx(x_cdf, st.pareto.cdf(x_cdf, 1))
Upvotes: 0
Reputation: 560
Not sure about why your function is not working exactly but one way of calculating CDF is as follows:
def get_cdf_1(data):
# start with sorted list of data
x = [i for i in sorted(data)]
cdf = []
for xs in x:
# get the sum of the values less than each data point and store that value
# this is normalised by the sum of all values
cum_val = sum([i for i in data if i <= xs])/sum(data)
cdf.append(cum_val)
return x, cdf
There is no doubt a faster way of computing this using numpy arrays rather than appending values to a list, but this returns values in the same format as your original example.
Upvotes: 0