codeKiller
codeKiller

Reputation: 5739

calculating mean of a list into sub-lists

If I have a big list or numpy array or etc that I need to split into sub-lists, how could I efficiently calculate the stadistics (mean, standar deviation, etc) for the whole list?

As a simple example, let's say that I have this small list:

l = [2,1,4,1,2,1,3,2,1,5]
>>> mean(l)
2.2000000000000002

But, if for some reason I need to split into sub-lists:

l1 = [2,1,4,1]
l2 = [2,1,3,2]
l3 = [1,5]

Of course, you don't need to know a lot about mathematics to know that this is NOT TRUE:

mean(l) = mean(mean(l1), mean(l2), mean(l3))

This may be true just if the lenght of all and every list is the same, which is not in this case.

The background of this question is related to the case when you have a very big dataset that does not fit into memory, and thus, you will need to split it into chucks.

Upvotes: 1

Views: 157

Answers (2)

Robert Dodier
Robert Dodier

Reputation: 17576

In general, you need to keep the so-called sufficient statistics for each subset. For the mean and standard deviation, the sufficient statistics are the number of data, their sum, and their sum of squares. Given those 3 quantities for each subset, you can compute the mean and standard deviation for the whole set.

The sufficient statistics are not necessarily any smaller than the subset itself. But for mean and standard deviation, the sufficient statistics are just a few numbers.

Upvotes: 2

jonnybazookatone
jonnybazookatone

Reputation: 2268

I assume you know the number of data points you have, i.e., len(l)? Then you could just calculate a sum of each list indidividually (i.e., Map-reduce) or a running sum (i.e, if you are doing a readline()), and then divide by len(l) at the very end?

Upvotes: 0

Related Questions