Reputation: 87
I'm trying to plot a Probability Distribution Function for a given set of data from a csv file
import numpy as np
import math
import matplotlib.pyplot as plt
data=np.loadtxt('data.csv',delimiter=',',skiprows=1)
x_value1= data[:,1]
x_value2= data[:,2]
weight1= data[:,3]
weight2= data[:,4]
where weight1 is an array of data that represents the weight for data in x_value1 and weight2 represents the same for x_value2. I produce a histogram where I put the weights in the parameter
plt.hist(x_value1,bins=40,color='r', normed=True, weights=weight1, alpha=0.8, label='x_value1')
plt.hist(x_value2, bins=40,color='b', normed=True, weights=weight2, alpha=0.6, label='x_value2')
My problem now is converting this PDF to CDF. I read from one of the posts here that you can use numpy.cumsum() to convert a set of data to CDF, so I tried it together with np.histogram()
values1,base1= np.histogram(x_value1, bins=40)
values2,base2= np.histogram(x_value2, bins=40)
cumulative1=np.cumsum(values1)
cumulative2=np.cumsum(values2)
plt.plot(base1[:-1],cumulative1,c='red',label='x_value1')
plt.plot(base2[:-1],cumulative2,c='blue',label='x_value2')
plt.title("CDF for x_value1 and x_value2")
plt.xlabel("x")
plt.ylabel("y")
plt.show()
I don't know if this plot is right because I didn't include the weights (weight1 and weight2) while doing the CDF. How can I include the weights while plotting the CDF?
Upvotes: 0
Views: 2164
Reputation: 23550
If I understand your data correctly, you have a number of samples which have some weight associated with them. Maybe what you want is the experimental CDF of the sample.
The samples are in vector x
and weights in vector w
. Let us first construct a Nx2 array of them:
arr = np.column_stack((x,w))
Then we will sort this array by the samples:
arr = arr[arr[:,0].argsort()]
This sorting may look a bit odd, but argsort
gives the sorted order (0 for the smallest, 1 for the second smallest, etc.). When the two-column array is indexed by this result, the rows are arranged so that the first column is ascending. (Using only sort
with axis=0
does not work, as it sorts both columns independently.)
Now we can create the cumulative fraction by taking the cumulative sum of weights:
cum = np.cumsum(arr[:,1])
This must be normalized so that the full scale is 1.
cum /= cum[-1]
Now we can plot the cumulative distribution:
plt.plot(arr[:,0], cum)
Now X axis is the input value and Y axis corresponds to the fraction of samples below each level.
Upvotes: 2