Reputation: 330
I'm trying to get the 1-sigma (or 2-sigma) value from a 1-D histogram. I don't need the plot but just to make sure it is right, I decided to plot it. But comparing it to matplotlib histogram, it is a bit off. Here is a simple MWE. Is this correct?
import numpy as np
import matplotlib.pyplot as plt
a=np.array([-100,100,9,-9,2,-2,3,-3,5,-5])
#matplotlib histogram
plt.figure(),plt.hist(a)
#numpy histogram
ha,ba = np.histogram(a)
plt.figure()
plt.plot(ba[:-1],ha)
sigma = 1
sigmaleft = ha.mean() - sigma * ha.std()
sigmaright = ha.mean() + sigma * ha.std()
print([sigmaleft, sigmaright])
Upvotes: 1
Views: 198
Reputation: 114320
The variable ba
returned in ha,ba = np.histogram(a)
is the bin edges, which is why there is one more element than in the histogram data ha
. To plot "correctly", set your x-axis to the bin centers:
bc = 0.5 * (ba[:-1] + ba[1:])
plt.bar(bc, ha)
Just for fun, you can also write
bc = np.lib.stride_tricks.as_strided(ba, strides=ba.strides * 2, shape=(2, ba.size - 1)).mean(0)
All that being said, ha.std
is not a good approximation of the standard deviation of the data, any more than ha.mean
is of the mean. The bin counts are just weights, while the data is encoded in the x-axis. The mean can be approximated with something like
approxMean = (bc * ha).sum() / ha.sum()
Similarly you can do the following for the standard deviation:
approxStd = np.sqrt(((bc - approxMean)**2 * ha).sum() / ha.sum())
You can also use the alternative formula:
approxStd = np.sqrt((bc**2 * ha).sum() / ha.sum() - ((bc * ha) / ha.sum()).sum()**2)
In all cases, only do this if you don't have access to the real data. Computing the mean and standard deviation will be much more accurate than from the histogram.
Upvotes: 1