Cumulative Distribution Function from arbitrary Probability Distribution Function

Question

I'm trying to plot a Probability Distribution Function for a given set of data from a csv file

import numpy as np
import math
import matplotlib.pyplot as plt

data=np.loadtxt('data.csv',delimiter=',',skiprows=1)
x_value1= data[:,1]
x_value2= data[:,2]
weight1= data[:,3]
weight2= data[:,4]

where weight1 is an array of data that represents the weight for data in x_value1 and weight2 represents the same for x_value2. I produce a histogram where I put the weights in the parameter

plt.hist(x_value1,bins=40,color='r', normed=True, weights=weight1, alpha=0.8,    label='x_value1')
plt.hist(x_value2, bins=40,color='b', normed=True, weights=weight2,  alpha=0.6,  label='x_value2')

enter image description here

My problem now is converting this PDF to CDF. I read from one of the posts here that you can use numpy.cumsum() to convert a set of data to CDF, so I tried it together with np.histogram()

values1,base1= np.histogram(x_value1, bins=40)
values2,base2= np.histogram(x_value2, bins=40)

cumulative1=np.cumsum(values1)
cumulative2=np.cumsum(values2)

plt.plot(base1[:-1],cumulative1,c='red',label='x_value1')
plt.plot(base2[:-1],cumulative2,c='blue',label='x_value2')

plt.title("CDF for x_value1 and x_value2")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

enter image description here

I don't know if this plot is right because I didn't include the weights (weight1 and weight2) while doing the CDF. How can I include the weights while plotting the CDF?

DrV · Accepted Answer

If I understand your data correctly, you have a number of samples which have some weight associated with them. Maybe what you want is the experimental CDF of the sample.

The samples are in vector x and weights in vector w. Let us first construct a Nx2 array of them:

arr = np.column_stack((x,w))

Then we will sort this array by the samples:

arr = arr[arr[:,0].argsort()]

This sorting may look a bit odd, but argsort gives the sorted order (0 for the smallest, 1 for the second smallest, etc.). When the two-column array is indexed by this result, the rows are arranged so that the first column is ascending. (Using only sort with axis=0 does not work, as it sorts both columns independently.)

Now we can create the cumulative fraction by taking the cumulative sum of weights:

cum = np.cumsum(arr[:,1])

This must be normalized so that the full scale is 1.

cum /= cum[-1]

Now we can plot the cumulative distribution:

plt.plot(arr[:,0], cum)

Now X axis is the input value and Y axis corresponds to the fraction of samples below each level.

Cumulative Distribution Function from arbitrary Probability Distribution Function

Answers (1)

Related Questions