Pasindu Tennage
Pasindu Tennage

Reputation: 1568

Python: How to get Cumulative distribution function for continuous data values?

I have a set of data values, and I want to get the CDF (cumulative distribution function) for that data set.

Since this is a continuous variable, we can't use binning approach as mentioned in (How to get cumulative distribution function correctly for my data in python?). So I came up with following approach.

import scipy.stats as st

def trapezoidal_2(ag, a, b, n):
    h = np.float(b - a) / n
    s = 0.0
    s += ag(a)[0]/2.0
    for i in range(1, n):
        s += ag(a + i*h)[0]
    s += ag(b)[0]/2.0
    return s * h

def get_cdf(data):
    a = np.array(data)
    ag = st.gaussian_kde(a)

    cdf = [0]
    x = []
    k = 0

    max_data = max(data)

    while (k < max_data):
        x.append(k)
        k = k + 1

    sum_integral = 0

    for i in range(1, len(x)):
        sum_integral = sum_integral + (trapezoidal_2(ag, x[i - 1], x[i], 2))
        cdf.append(sum_integral)

    return x, cdf

This is how I use this method.

b = 1
data = st.pareto.rvs(b, size=10000)
data = list(data)    x_cdf, y_cdf = get_cdf(data)

Ideally I should get a value close to 1 at the end of y_cdf list. But I get a value close to 0.57.

What is going wrong here? Is my approach correct?

Thanks.

Upvotes: 0

Views: 4300

Answers (3)

Stop harming Monica
Stop harming Monica

Reputation: 12618

The value of the cdf at x is the integral of the pdf between -inf and x, but you are computing it between 0 and x. Maybe you are assuming that the pdf is 0 for x < 0 but it is not:

rs = np.random.RandomState(seed=52221829)
b = 1
data = st.pareto.rvs(b, size=10000, random_state=rs)
ag = st.gaussian_kde(data)

x = np.linspace(-100, 100)
plt.plot(x, ag.pdf(x))

enter image description here

So this is probably what's going wrong here: you not checking your assumptions.

Your code for computing the integral is painfully slow, there are better ways to do this with scipy but gaussian_kde provides the method integrate_box_1d to integrate the pdf. If you take the integral from -inf everything looks right.

cdf = np.vectorize(lambda x: ag.integrate_box_1d(-np.inf, x))
plt.plot(x, cdf(x))

enter image description here

Integrating between 0 and x you get the same you are seeing now (to the right of 0), but that's not a cdf at all:

wrong_cdf = np.vectorize(lambda x: ag.integrate_box_1d(0, x))
plt.plot(x, wrong_cdf(x))

enter image description here

Upvotes: 3

Sam Mason
Sam Mason

Reputation: 16213

I think it's just:

def get_cdf(data):
  return sorted(data), np.linspace(0, 1, len(data))

but I might be misinterpreting the question!

when I compare this to the analytic result I get the same:

x_cdf, y_cdf = get_cdf(st.pareto.rvs(1, size=10000))

import matplotlib.pyplot as plt
plt.semilogx(x_cdf, y_cdf)
plt.semilogx(x_cdf, st.pareto.cdf(x_cdf, 1))

Upvotes: 0

MattB
MattB

Reputation: 560

Not sure about why your function is not working exactly but one way of calculating CDF is as follows:

def get_cdf_1(data):

    # start with sorted list of data
    x = [i for i in sorted(data)]

    cdf = []

    for xs in x:
        # get the sum of the values less than each data point and store that value
        # this is normalised by the sum of all values
        cum_val = sum([i for i in data if i <= xs])/sum(data) 
        cdf.append(cum_val)

    return x, cdf

There is no doubt a faster way of computing this using numpy arrays rather than appending values to a list, but this returns values in the same format as your original example.

Upvotes: 0

Related Questions