Reputation: 1330
Reading through the matplotlib plt.hist documentations , there is a density parameter that can be set to true.The documentation says
density : bool, optional
If ``True``, the first element of the return tuple will
be the counts normalized to form a probability density, i.e.,
the area (or integral) under the histogram will sum to 1.
This is achieved by dividing the count by the number of
observations times the bin width and not dividing by the total
number of observations. If *stacked* is also ``True``, the sum of
the histograms is normalized to 1.
The line This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations
I tried replicating this with the sample data.
**Using matplotlib inbuilt calculations** .
ser = pd.Series(np.random.normal(size=1000))
ser.hist(density = 1, bins=100)
**Manual calculation of the density** :
arr_hist , edges = np.histogram( ser, bins =100)
samp = arr_hist / ser.shape[0] * np.diff(edges)
plt.bar(edges[0:-1] , samp )
plt.grid()
Both the plots are completely different on the y-axis scales , could someone point what exactly is going wrong and how to replicate the density calculation manually ?
Upvotes: 2
Views: 5458
Reputation: 339200
That is an ambiguity in the language. The sentence
This is achieved by dividing the count by the number of observations times the bin width
needs to be read like
This is achieved by dividing (the count) by (the number of observations times the bin width)
i.e.
count / (number of observations * bin width)
Complete code:
import numpy as np
import matplotlib.pyplot as plt
arr = np.random.normal(size=1000)
fig, (ax1, ax2) = plt.subplots(2)
ax1.hist(arr, density = True, bins=100)
ax1.grid()
arr_hist , edges = np.histogram(arr, bins =100)
samp = arr_hist / (arr.shape[0] * np.diff(edges))
ax2.bar(edges[0:-1] , samp, width=np.diff(edges) )
ax2.grid()
plt.show()
Upvotes: 3