Reputation: 5739
If I have a big list or numpy array or etc that I need to split into sub-lists, how could I efficiently calculate the stadistics (mean, standar deviation, etc) for the whole list?
As a simple example, let's say that I have this small list:
l = [2,1,4,1,2,1,3,2,1,5]
>>> mean(l)
2.2000000000000002
But, if for some reason I need to split into sub-lists:
l1 = [2,1,4,1]
l2 = [2,1,3,2]
l3 = [1,5]
Of course, you don't need to know a lot about mathematics to know that this is NOT TRUE:
mean(l) = mean(mean(l1), mean(l2), mean(l3))
This may be true just if the lenght of all and every list is the same, which is not in this case.
The background of this question is related to the case when you have a very big dataset that does not fit into memory, and thus, you will need to split it into chucks.
Upvotes: 1
Views: 157
Reputation: 17576
In general, you need to keep the so-called sufficient statistics for each subset. For the mean and standard deviation, the sufficient statistics are the number of data, their sum, and their sum of squares. Given those 3 quantities for each subset, you can compute the mean and standard deviation for the whole set.
The sufficient statistics are not necessarily any smaller than the subset itself. But for mean and standard deviation, the sufficient statistics are just a few numbers.
Upvotes: 2
Reputation: 2268
I assume you know the number of data points you have, i.e., len(l)? Then you could just calculate a sum of each list indidividually (i.e., Map-reduce) or a running sum (i.e, if you are doing a readline()), and then divide by len(l) at the very end?
Upvotes: 0