J Kelly
J Kelly

Reputation: 510

numpy histogram cumulative density does not sum to 1

Taking a tip from another thread (@EnricoGiampieri's answer to cumulative distribution plots python), I wrote:

# plot cumulative density function of nearest nbr distances
# evaluate the histogram
values, base = np.histogram(nearest, bins=20, density=1)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, label='data')

I put in the density=1 from the documentation on np.histogram, which says:

Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.

Well, indeed, when plotted, they don't sum to 1. But, I do not understand the "bins of unity width." When I set the bins to 1, of course, I get an empty chart; when I set them to the population size, I don't get a sum to 1 (more like 0.2). When I use the 40 bins suggested, they sum to about .006.

Can anybody give me some guidance? Thanks!

Upvotes: 19

Views: 28068

Answers (4)

coffeenino
coffeenino

Reputation: 11

I found an easier solution, which makes your bins add up to 1.

Set

density = False

and instead use weights, like so:

weights=np.ones(len(values)) / len(values)

I can't explain it properly, but you can read it here: https://github.com/matplotlib/matplotlib/issues/10398/

Upvotes: 0

bhanu pratap
bhanu pratap

Reputation: 634

Actually the statement

"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "

means that the output that we are getting is the probability density function for the respective bins, now since in pdf, the probability between two value say 'a' and 'b' is represented by the area under the pdf curve between the range 'a' and 'b'. therefore to get the probability value for a respective bin, we have to multiply the pdf value of that bin by its bin width, and then the sequence of probabilities obtained can be directly used for calculating the cumulative probabilities(as they are now normalized).

note that the sum of the new calculated probabilities will give 1, which satisfies the fact that the sum of total probability is 1, or in other words, we can say that our probabilities are normalized.

see code below, here i have use bins of different widths, some are of width 1 and some are of width 2,

import numpy as np
import math
rng = np.random.RandomState(10)   # deterministic random data
a = np.hstack((rng.normal(size=1000),
               rng.normal(loc=5, scale=2, size=1000))) # 'a' is our distribution of data
mini=math.floor(min(a))
maxi=math.ceil(max(a))
print(mini)
print(maxi)
ar1=np.arange(mini,maxi/2)
ar2=np.arange(math.ceil(maxi/2),maxi+2,2)
ar=np.hstack((ar1,ar2))
print(ar)  # ar is the array of unequal widths, which is used below to generate the bin_edges
counts, bin_edges = np.histogram(a, bins=ar, 
                             density = True)
print(counts)    # the pdf values of respective bin_edges
print(bin_edges) # the corresponding bin_edges
print(np.sum(counts*np.diff(bin_edges)))  #finding total sum of probabilites, equal to 1
print(np.cumsum(counts*np.diff(bin_edges))) #to get the cummulative sum, see the last value, it is 1.

Now the reason I think they try to mention by saying that the width of bins should be 1, is might be because of the fact that if the width of bins is equal to 1, then the value of pdf and probabilities for any bin are equal, because if we calculate the area under the bin, then we are basically multiplying the 1 with the corresponding pdf of that bin, which is again equal to that pdf value. so in this case, the value of pdf is equal to the value of the respective bins probabilities and hence already normalized.

Upvotes: 0

Paul H
Paul H

Reputation: 68256

You can simply normalize your values variable yourself like so:

unity_values = values / values.sum()

A full example would look something like this:

import numpy as np
import matplotlib.pyplot as plt

x = np.random.normal(size=37)
density, bins = np.histogram(x, normed=True, density=True)
unity_density = density / density.sum()

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(8,4))
widths = bins[:-1] - bins[1:]
ax1.bar(bins[1:], density, width=widths)
ax2.bar(bins[1:], density.cumsum(), width=widths)

ax3.bar(bins[1:], unity_density, width=widths)
ax4.bar(bins[1:], unity_density.cumsum(), width=widths)

ax1.set_ylabel('Not normalized')
ax3.set_ylabel('Normalized')
ax3.set_xlabel('PDFs')
ax4.set_xlabel('CDFs')
fig.tight_layout()

enter image description here

Upvotes: 24

perimosocordiae
perimosocordiae

Reputation: 17847

You need to make sure your bins are all width 1. That is:

np.all(np.diff(base)==1)

To achieve this, you have to manually specify your bins:

bins = np.arange(np.floor(nearest.min()),np.ceil(nearest.max()))
values, base = np.histogram(nearest, bins=bins, density=1)

And you get:

In [18]: np.all(np.diff(base)==1)
Out[18]: True

In [19]: np.sum(values)
Out[19]: 0.99999999999999989

Upvotes: 12

Related Questions