hamster on wheels
hamster on wheels

Reputation: 2893

Reduce rounding error of sum of square of floats

I try to calculate the sum of square of an array of float.

How to reduce the rounding error?

I am trying to sum about 5,000,000 floats in the inner loop of my actual program.

test.cpp:

#include <iostream>
#include <stdint.h>
template <typename Sum, typename Element>
Sum sum(const size_t st, const size_t en) {
    Sum s = 0;
    for (size_t i = st; i < en; ++ i) {
        s += Element(i)*Element(i); 
    }
    return s;
}
int main() {
    size_t size = 100000;
    std::cout << "double, float: " 
              << sum<double, float>(0,size) << "\n";
    std::cout << "int, int: " 
              << sum<int, int>(0,size) << "\n";
}

Output:

double, float: 3.33328e+14
int, int: 216474736

Upvotes: 1

Views: 1253

Answers (4)

infiniteLoop
infiniteLoop

Reputation: 381

When you use the type int for Element, there is an overflow on the square for each i after std::sqrt(std::numeric_limits<int>::max()), which could be 46341 on your system. There is also an overflow on the sum when it reaches std::numeric_limits<int>::max().

You could use the type long or long long instead of int to increase this number.

It is also a good idea to store or cast the first float into a double or long double before multiplying to reduce the error on the floating point square operation as well. Rounding on the last steps of a set of computations gives always a better result than rounding on the early steps, because you avoid to propagate (and increase) the representation errors on your internal computations.

If you really want precision, and do not want to reinvent the wheel using some complicated techniques, you can use a multi-precision library like GNU Multi-Precision Library or Boost Multiprecision : https://en.wikipedia.org/wiki/List_of_arbitrary-precision_arithmetic_software

They are more precise that the long double type of your system

Upvotes: 2

Peter
Peter

Reputation: 36617

If all you want to do is add up squares of consecutive values, use the formula n*(n+1)*(2n+1)/6to calculate the sum of the squares of all values from 1 to n.

That eliminates most effects of rounding as long as you use a type that can represent the result. For example;

 template<typename Sum> Sum sumsq(size_t n)
 {
     // calculates sum of squares from 1 to x
     //   assumes size_t can be promoted to a Sum

     Sum temp(n);     // force promotion to type Sum
     return temp * (temp + 1)* (2*temp + 1)/6;
 }

 template<typename Sum> Sum alternate_sum(size_t st, size_t en)
 {
        Sum result = sumsq(en - 1);
        if (st > 0) result -= sumsq(st-1);
        return result;
 }

 int main()
 {
     size_t size = 100000;
     std::cout << "double, float: " 
              << alternate_sum<double>(0,size) << "\n";
     std::cout << "int, int: " 
          << alternate_sum<long long>(0,size) << "\n";
 }

Note that, for size equal to 100000, using int to hold the result gives undefined behaviour (overflow of a signed integral type).

The -1s in the alternate_sum() reflect your loops being of the form for (size_t i = st; i < en; ++ i)

You could eliminate the use of the size_t type as a fixed feature, but I'll leave that as an exercise.

BTW: since you say this code is in an inner loop, it's worth noting that this formula will be significantly faster than the loops you have been using.

Upvotes: 2

user1196549
user1196549

Reputation:

A float has 24 significant bits, while a double has 53 of them. So you have 29 guard bits, which is about 100 times more than 5000000.

So rounding errors will only arise for values which are 100 times smaller than the largest.

Also note that in the Intel architecture, the floating-point registers actually hold extended precision numbers on 80 bits, of which 63 are significant.

Then only numbers smaller than 100000 times the largest will incur truncation.

Should you really worry ?

Upvotes: 2

rcgldr
rcgldr

Reputation: 28921

If the format of the floats is known, such as IEEE, then an array indexed by the exponent of a float can be used to store partial sums, then summed up to produce the total sum. During the array update, only floats with the same exponent are added together and stored into the array in the appropriate location. The final summation goes from smallest to largest. For C++, the array and functions could be members of a class.

Example for floats where the array is passed as a parameter to the functions:

/* clear array */
void clearsum(float asum[256])
{
size_t i;
    for(i = 0; i < 256; i++)
        asum[i] = 0.f;
}

/* add a number into array */
void addtosum(float f, float asum[256])
{
size_t i;
    while(1){
        /* i = exponent of f */
        i = ((size_t)((*(unsigned int *)&f)>>23))&0xff;
        if(i == 0xff){          /* max exponent, could be overflow */
            asum[i] += f;
            return;
        }
        if(asum[i] == 0.f){     /* if empty slot store f */
            asum[i] = f;
            return;
        }
        f += asum[i];           /* else add slot to f, clear slot */
        asum[i] = 0.f;          /* and continue until empty slot */
    }
}

/* return sum from array */
float returnsum(float asum[256])
{
float sum = 0.f;
size_t i;
    for(i = 0; i < 256; i++)
        sum += asum[i];
    return sum;
}

Upvotes: 2

Related Questions