Reputation: 1665
I have an array of values and have created a histogram of the data using numpy.histogram, as follows:
histo = numpy.histogram(arr, nbins)
where nbins is the number of bins derived from the range of the data (max-min) divided by a desired bin width.
From the output I create a cumulative distribution function using:
cdf = np.cumsum(histo[0])
normCdf = cdf/np.amax(cdf)
However, I need an array of normCdf values that corresponds with the values in the original array (arr). For example, if a value in the original array arr is near the minimum value of arr then its corresponding normCdf value will be high (i.e 0.95). (In this example, as I am working with radar data my data is in decibels and is negative. Therefore the lowest value is where the CDF reaches its maximum.)
Im struggling, conceptually, how I achieve an array whereby each value in the array has its corresponding value under the CDF (normCdf value). Any help would be appreciated. The histogram with the cdf is below.
Upvotes: 1
Views: 4815
Reputation: 14255
This is old, but may still be of help to someone.
Consider the OP's last sentence:
Im struggling, conceptually, how I achieve an array whereby each value in the array has its corresponding value under the CDF (normCdf value).
If I understand correctly, what the OP is asking for, actually boils down to the (normalized) ordinal rank of the array elements.
The ordinal rank of an array element i
basically indicates how many elements in the array have a value smaller than that of element i
. This is equivalent to the discrete cumulative density.
Ordinal ranking is related to sorting by the following equality (where u
is an unsorted list):
u == [sorted(u)[i] for i in ordinal_rank(u)]
Based on the implementation of scipy.stats.rankdata, the ordinal rank can be computed as follows:
def ordinal_rank(data):
rank = numpy.empty(data.size)
rank[numpy.argsort(data)] = numpy.arange(data.size)
return rank
So, to answer the OP's question:
The normalized (empirical) cumulative density corresponding to the values in the OP's arr
can then be computed as follows:
normalized_cdf = ordinal_rank(arr) / len(arr)
And the result can be displayed using:
pyplot.plot(arr, normalized_cdf, marker='.', linestyle='')
Note, that, if you only need the plot, there is an easier way:
n = len(arr)
pyplot.plot(numpy.sort(arr), numpy.arange(n) / n)
And, finally, we can verify this by plotting the cumulative normalized histogram as follows (using an arbitrary number of bins):
pyplot.hist(arr, bins=100, cumulative=True, density=True)
Here's an example comparing the three approaches, using 30 bins for the cumulative histogram:
Upvotes: 0