Reputation: 73
Let's say I have a data set and used matplotlib to draw a histogram of said data set.
n, bins, patches = plt.hist(data, normed=1)
How do I calculate the standard deviation, using the n
and bins
values that hist()
returns? I'm currently doing this to calculate the mean:
s = 0
for i in range(len(n)):
s += n[i] * ((bins[i] + bins[i+1]) / 2)
mean = s / numpy.sum(n)
which seems to work fine as I get pretty accurate results. However, if I try to calculate the standard deviation like this:
t = 0
for i in range(len(n)):
t += (bins[i] - mean)**2
std = np.sqrt(t / numpy.sum(n))
my results are way off from what numpy.std(data)
returns. Replacing the left bin limits with the central point of each bin doesn't change this either. I have the feeling that the problem is that the n
and bins
values don't actually contain any information on how the individual data points are distributed within each bin, but the assignment I'm working on clearly demands that I use them to calculate the standard deviation.
Upvotes: 7
Views: 38509
Reputation: 2177
The following answer is equivalent to Warren Weckesser's, but maybe more familiar to those who prefer to want mean as the expected value:
counts, bins = np.histogram(x)
mids = 0.5*(bins[1:] + bins[:-1])
probs = counts / np.sum(counts)
mean = np.sum(probs * mids)
sd = np.sqrt(np.sum(probs * (mids - mean)**2))
Do take note in certain context you may want the unbiased sample variance where the weights are not normalized by N but N-1.
Upvotes: 6
Reputation: 114781
You haven't weighted the contribution of each bin with n[i]
. Change the increment of t
to
t += n[i]*(bins[i] - mean)**2
By the way, you can simplify (and speed up) your calculation by using numpy.average
with the weights
argument.
Here's an example. First, generate some data to work with. We'll compute the sample mean, variance and standard deviation of the input before computing the histogram.
In [54]: x = np.random.normal(loc=10, scale=2, size=1000)
In [55]: x.mean()
Out[55]: 9.9760798903061847
In [56]: x.var()
Out[56]: 3.7673459904902025
In [57]: x.std()
Out[57]: 1.9409652213499866
I'll use numpy.histogram
to compute the histogram:
In [58]: n, bins = np.histogram(x)
mids
is the midpoints of the bins; it has the same length as n
:
In [59]: mids = 0.5*(bins[1:] + bins[:-1])
The estimate of the mean is the weighted average of mids
:
In [60]: mean = np.average(mids, weights=n)
In [61]: mean
Out[61]: 9.9763028267760312
In this case, it is pretty close to the mean of the original data.
The estimated variance is the weighted average of the squared difference from the mean:
In [62]: var = np.average((mids - mean)**2, weights=n)
In [63]: var
Out[63]: 3.8715035807387328
In [64]: np.sqrt(var)
Out[64]: 1.9676136767004677
That estimate is within 2% of the actual sample standard deviation.
Upvotes: 19