Histograms Matplotlib vs Numpy

Question

I'm trying to get the 1-sigma (or 2-sigma) value from a 1-D histogram. I don't need the plot but just to make sure it is right, I decided to plot it. But comparing it to matplotlib histogram, it is a bit off. Here is a simple MWE. Is this correct?

import numpy as np
import matplotlib.pyplot as plt

a=np.array([-100,100,9,-9,2,-2,3,-3,5,-5])

#matplotlib histogram
plt.figure(),plt.hist(a)

#numpy histogram
ha,ba = np.histogram(a)
plt.figure()
plt.plot(ba[:-1],ha)
sigma = 1
sigmaleft  = ha.mean() - sigma * ha.std()
sigmaright = ha.mean() + sigma * ha.std()
print([sigmaleft, sigmaright])

Mad Physicist · Accepted Answer

The variable ba returned in ha,ba = np.histogram(a) is the bin edges, which is why there is one more element than in the histogram data ha. To plot "correctly", set your x-axis to the bin centers:

bc = 0.5 * (ba[:-1] + ba[1:])
plt.bar(bc, ha)

Just for fun, you can also write

bc = np.lib.stride_tricks.as_strided(ba, strides=ba.strides * 2, shape=(2, ba.size - 1)).mean(0)

All that being said, ha.std is not a good approximation of the standard deviation of the data, any more than ha.mean is of the mean. The bin counts are just weights, while the data is encoded in the x-axis. The mean can be approximated with something like

approxMean = (bc * ha).sum() / ha.sum()

Similarly you can do the following for the standard deviation:

approxStd = np.sqrt(((bc - approxMean)**2 * ha).sum() / ha.sum())

You can also use the alternative formula:

approxStd = np.sqrt((bc**2 * ha).sum() / ha.sum() - ((bc * ha) / ha.sum()).sum()**2)

In all cases, only do this if you don't have access to the real data. Computing the mean and standard deviation will be much more accurate than from the histogram.

Histograms Matplotlib vs Numpy

Answers (1)

Related Questions