GeoMonkey
GeoMonkey

Reputation: 1665

Numpy cumulative distribution function (CDF)

I have an array of values and have created a histogram of the data using numpy.histogram, as follows:

histo = numpy.histogram(arr, nbins)

where nbins is the number of bins derived from the range of the data (max-min) divided by a desired bin width.

From the output I create a cumulative distribution function using:

cdf = np.cumsum(histo[0])
normCdf = cdf/np.amax(cdf)

However, I need an array of normCdf values that corresponds with the values in the original array (arr). For example, if a value in the original array arr is near the minimum value of arr then its corresponding normCdf value will be high (i.e 0.95). (In this example, as I am working with radar data my data is in decibels and is negative. Therefore the lowest value is where the CDF reaches its maximum.)

Im struggling, conceptually, how I achieve an array whereby each value in the array has its corresponding value under the CDF (normCdf value). Any help would be appreciated. The histogram with the cdf is below.

enter image description here

Upvotes: 1

Views: 4815

Answers (1)

djvg
djvg

Reputation: 14255

This is old, but may still be of help to someone.

Consider the OP's last sentence:

Im struggling, conceptually, how I achieve an array whereby each value in the array has its corresponding value under the CDF (normCdf value).

If I understand correctly, what the OP is asking for, actually boils down to the (normalized) ordinal rank of the array elements.

The ordinal rank of an array element i basically indicates how many elements in the array have a value smaller than that of element i. This is equivalent to the discrete cumulative density.

Ordinal ranking is related to sorting by the following equality (where u is an unsorted list):

u == [sorted(u)[i] for i in ordinal_rank(u)]

Based on the implementation of scipy.stats.rankdata, the ordinal rank can be computed as follows:

def ordinal_rank(data):
    rank = numpy.empty(data.size)
    rank[numpy.argsort(data)] = numpy.arange(data.size)
    return rank

So, to answer the OP's question:

The normalized (empirical) cumulative density corresponding to the values in the OP's arr can then be computed as follows:

normalized_cdf = ordinal_rank(arr) / len(arr)

And the result can be displayed using:

pyplot.plot(arr, normalized_cdf, marker='.', linestyle='')

Note, that, if you only need the plot, there is an easier way:

n = len(arr)
pyplot.plot(numpy.sort(arr), numpy.arange(n) / n)

And, finally, we can verify this by plotting the cumulative normalized histogram as follows (using an arbitrary number of bins):

pyplot.hist(arr, bins=100, cumulative=True, density=True)

Here's an example comparing the three approaches, using 30 bins for the cumulative histogram:

empirical cumulative density

Upvotes: 0

Related Questions