function
function

Reputation: 1330

How does matplotlib calculate the density for historgram

Reading through the matplotlib plt.hist documentations , there is a density parameter that can be set to true.The documentation says

density : bool, optional
            If ``True``, the first element of the return tuple will
            be the counts normalized to form a probability density, i.e.,
            the area (or integral) under the histogram will sum to 1.
            This is achieved by dividing the count by the number of
            observations times the bin width and not dividing by the total
            number of observations. If *stacked* is also ``True``, the sum of
            the histograms is normalized to 1.

The line This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations

I tried replicating this with the sample data.

**Using matplotlib inbuilt calculations** .

ser = pd.Series(np.random.normal(size=1000))
ser.hist(density = 1,  bins=100)

**Manual calculation of the density** : 

arr_hist , edges = np.histogram( ser, bins =100)
samp = arr_hist / ser.shape[0] * np.diff(edges)
plt.bar(edges[0:-1] , samp )
plt.grid()

Both the plots are completely different on the y-axis scales , could someone point what exactly is going wrong and how to replicate the density calculation manually ?

Upvotes: 2

Views: 5458

Answers (1)

ImportanceOfBeingErnest
ImportanceOfBeingErnest

Reputation: 339200

That is an ambiguity in the language. The sentence

This is achieved by dividing the count by the number of observations times the bin width

needs to be read like

This is achieved by dividing (the count) by (the number of observations times the bin width)

i.e.

count / (number of observations * bin width)

Complete code:

import numpy as np
import matplotlib.pyplot as plt

arr = np.random.normal(size=1000)

fig, (ax1, ax2) = plt.subplots(2)
ax1.hist(arr, density = True,  bins=100)
ax1.grid()


arr_hist , edges = np.histogram(arr, bins =100)
samp = arr_hist / (arr.shape[0] * np.diff(edges))
ax2.bar(edges[0:-1] , samp, width=np.diff(edges) )
ax2.grid()

plt.show()

enter image description here

Upvotes: 3

Related Questions