Martin Drozdik
Martin Drozdik

Reputation: 13313

How to compute mean (average) robustly?

If we compute the mean naively:

std::vector<double> values;
double sum = std::accumulate(begin(values), end(values), 0.0);
double mean = sum / values.size();

and values.size() is big, we can get inaccurate results, since the floating point numbers have less resolution in higher ranges. Or worse, if I understand it correctly, we can get an infinite result.

When we have an even number of values, we can compute the mean of the first half, then the second and find the mean of these two means.

This doesn't seem to be a new problem, but I have trouble finding resources. I think there are more sophisticated techniques with trade-offs in

and I wonder if someone has summarized them somewhere or even better if they are available in some library.

Upvotes: 10

Views: 1870

Answers (5)

thus spake a.k.
thus spake a.k.

Reputation: 1637

If you're willing to muck about with values in the process, a simple and robust scheme is to first sort it by magnitude:

struct fabs_less
{
    bool
    operator()(const double x0, const double x1) const
    {
        return fabs(x0)<fabs(x1);
    }
};

std::sort(values.begin(), values.end(), fabs_less());
const double sum = std::accumulate(values.begin(), values.end(), 0.0);
const double mean = sum / double(values.size());

This increases the computational complexity to N log N but results in the minimum possible rounding error.

Edit: tmyklebu makes a very good point with a degenerate case (curses that I missed it). Instead accumulate the negative and positive terms seperately in order of increasing magnitude:

std::sort(values.begin(), values.end());
std::vector<double>::const_iterator mid = std::upper_bound(values.begin(), values.end(), 0.0);
std::reverse_iterator<std::vector<double>::const_iterator> rmid(mid);
const double neg = std::accumulate(rmid, values.rend(), 0.0);
const double pos = std::accumulate(mid, values.end(), 0.0);
const double mean = (neg+pos) / double(values.size());

This introduces the possibility of cancellation error in neg+pos, but will still have a small error relative to the sum of the absolute values of the elements of values which I think is the best that you can hope for without some seriously complicated logic...

Upvotes: 4

aka.nice
aka.nice

Reputation: 9382

Generally, a divide and conquer technique (a recursive split in two parts) is robust.

See my answer to Precise sum of floating point numbers where I demonstrate it with a recursive form.

Note that there is no recursive tail call elimination in C/C++, so this implementation is not necessarily efficient (it leads to a deep stack).

Upvotes: 2

user3344003
user3344003

Reputation: 21627

Pardon, not doing this as comment due to length. A double value value usually has more than 50 bits of precision. You're talking about 1 part in a trillion or more.

The resolution of a floating point number remains the same on a fractional basis throughout its range.

But, if you add 1234E40 to 1234E-040 you're going to get 1234E40. Adding values of different orders of magnitude will through an average off. However, the amount it will be off is usually so small (trillionths) that it is rarely noticeable.

In nearly all cases, you can do an average simply by adding and dividing by the count and get a very precise answer.

You might even be able to do a long double on your systems.

If you have some data set where this is not the case, maybe you can describe that data set and the problems it presents. From that, we could come up with a solution to your particular problem.

Upvotes: 1

tmyklebu
tmyklebu

Reputation: 14205

Lots of stupid things can happen here. One problem is the overflow thing. Another is exemplified by this: (1e100 + 1) - 1e100) == 0. Another is just accumulated roundoff.

Kahan summation handles accumulated roundoff very well for well-scaled data. Find the sum using Kahan summation then divide by the number of data.

To deal with poorly-scaled data, you might bucket the data by exponent (say 50 different buckets each covering about 20 different exponents) and Kahan-sum in descending bucket order.

This is all massive overkill, of course, and it's rather slow. In practice, using vector instructions and stuff like that helps with speed and with precision.

Upvotes: 9

Juan Lopes
Juan Lopes

Reputation: 10565

You can use an online algorithm as described here.

Basically (in pythonish pseudo-code):

n = 0
mean = 0

for value in data:
    n += 1
    mean += (value - mean)/n

This algorithm is more numerically stable than the naïve implementation.

Upvotes: 9

Related Questions