Reputation: 1990
I have a list of sets and some basic statistics for each one (number of items, min, max, mean, stddev). I would like to calculate the same statistics for all of the sets combined. Calculating the total count, min max and mean is easy, but I'm unsure how to calculate the total standard deviation.
The data looks like this:
Count Max Min Mean Stddev
1,027,671 781 68 57.8 32.79
839,473 552 54 61.3 48.53
3,012,102 890 41 64.9 41.92
Generating the statistics for all of the sets together:
4,879,246 890 41 62.8 ???
Upvotes: 2
Views: 848
Reputation: 9290
I assume you are writing the code that maintains the distribution, and not just consuming some data that already has the standard deviation computed. The standard dev isn't a really natural parameter to maintain for a computer. Instead, You should maintain the number of items, the sum, and the sum of the items squared, and then you easily compute the mean and standard deviation the distribution from those 3 pieces of raw information. I use this strategy in this code here. The add operation supports merging two distributions. Notice how simple its implementation is. http://github.com/rrenaud/dominionstats/blob/master/stats.py#L17.
Upvotes: 3
Reputation: 58594
I think it is impossible to calculate this exactly from the data you have. The problem is that the standard deviation depends on the mean of the combined data set which isn't necessarily the same as the individual means, and also on the distances of each point from that mean to which you have no exact (but maybe approximate) access.
Upvotes: 0