Pedro
Pedro

Reputation: 330

Histograms Matplotlib vs Numpy

I'm trying to get the 1-sigma (or 2-sigma) value from a 1-D histogram. I don't need the plot but just to make sure it is right, I decided to plot it. But comparing it to matplotlib histogram, it is a bit off. Here is a simple MWE. Is this correct?

import numpy as np
import matplotlib.pyplot as plt

a=np.array([-100,100,9,-9,2,-2,3,-3,5,-5])

#matplotlib histogram
plt.figure(),plt.hist(a)

#numpy histogram
ha,ba = np.histogram(a)
plt.figure()
plt.plot(ba[:-1],ha)
sigma = 1
sigmaleft  = ha.mean() - sigma * ha.std()
sigmaright = ha.mean() + sigma * ha.std()
print([sigmaleft, sigmaright])

Matplotlib histogram Numpy histogram

Upvotes: 1

Views: 198

Answers (1)

Mad Physicist
Mad Physicist

Reputation: 114320

The variable ba returned in ha,ba = np.histogram(a) is the bin edges, which is why there is one more element than in the histogram data ha. To plot "correctly", set your x-axis to the bin centers:

bc = 0.5 * (ba[:-1] + ba[1:])
plt.bar(bc, ha)

Just for fun, you can also write

bc = np.lib.stride_tricks.as_strided(ba, strides=ba.strides * 2, shape=(2, ba.size - 1)).mean(0)

All that being said, ha.std is not a good approximation of the standard deviation of the data, any more than ha.mean is of the mean. The bin counts are just weights, while the data is encoded in the x-axis. The mean can be approximated with something like

approxMean = (bc * ha).sum() / ha.sum()

Similarly you can do the following for the standard deviation:

approxStd = np.sqrt(((bc - approxMean)**2 * ha).sum() / ha.sum())

You can also use the alternative formula:

approxStd = np.sqrt((bc**2 * ha).sum() / ha.sum() - ((bc * ha) / ha.sum()).sum()**2)

In all cases, only do this if you don't have access to the real data. Computing the mean and standard deviation will be much more accurate than from the histogram.

Upvotes: 1

Related Questions