Reputation: 2893
I try to calculate the sum of square of an array of float.
How to reduce the rounding error?
I am trying to sum about 5,000,000 floats in the inner loop of my actual program.
test.cpp:
#include <iostream>
#include <stdint.h>
template <typename Sum, typename Element>
Sum sum(const size_t st, const size_t en) {
Sum s = 0;
for (size_t i = st; i < en; ++ i) {
s += Element(i)*Element(i);
}
return s;
}
int main() {
size_t size = 100000;
std::cout << "double, float: "
<< sum<double, float>(0,size) << "\n";
std::cout << "int, int: "
<< sum<int, int>(0,size) << "\n";
}
Output:
double, float: 3.33328e+14
int, int: 216474736
Upvotes: 1
Views: 1253
Reputation: 381
When you use the type int
for Element
, there is an overflow on the square for each i
after std::sqrt(std::numeric_limits<int>::max())
, which could be 46341
on your system.
There is also an overflow on the sum when it reaches std::numeric_limits<int>::max()
.
You could use the type long
or long long
instead of int
to increase this number.
It is also a good idea to store or cast the first float
into a double
or long double
before multiplying to reduce the error on the floating point square operation as well.
Rounding on the last steps of a set of computations gives always a better result than rounding on the early steps, because you avoid to propagate (and increase) the representation errors on your internal computations.
If you really want precision, and do not want to reinvent the wheel using some complicated techniques, you can use a multi-precision library like GNU Multi-Precision Library
or Boost Multiprecision
:
https://en.wikipedia.org/wiki/List_of_arbitrary-precision_arithmetic_software
They are more precise that the long double
type of your system
Upvotes: 2
Reputation: 36617
If all you want to do is add up squares of consecutive values, use the formula n*(n+1)*(2n+1)/6
to calculate the sum of the squares of all values from 1
to n
.
That eliminates most effects of rounding as long as you use a type that can represent the result. For example;
template<typename Sum> Sum sumsq(size_t n)
{
// calculates sum of squares from 1 to x
// assumes size_t can be promoted to a Sum
Sum temp(n); // force promotion to type Sum
return temp * (temp + 1)* (2*temp + 1)/6;
}
template<typename Sum> Sum alternate_sum(size_t st, size_t en)
{
Sum result = sumsq(en - 1);
if (st > 0) result -= sumsq(st-1);
return result;
}
int main()
{
size_t size = 100000;
std::cout << "double, float: "
<< alternate_sum<double>(0,size) << "\n";
std::cout << "int, int: "
<< alternate_sum<long long>(0,size) << "\n";
}
Note that, for size
equal to 100000
, using int
to hold the result gives undefined behaviour (overflow of a signed integral type).
The -1
s in the alternate_sum()
reflect your loops being of the form for (size_t i = st; i < en; ++ i)
You could eliminate the use of the size_t
type as a fixed feature, but I'll leave that as an exercise.
BTW: since you say this code is in an inner loop, it's worth noting that this formula will be significantly faster than the loops you have been using.
Upvotes: 2
Reputation:
A float has 24 significant bits, while a double has 53 of them. So you have 29 guard bits, which is about 100 times more than 5000000.
So rounding errors will only arise for values which are 100 times smaller than the largest.
Also note that in the Intel architecture, the floating-point registers actually hold extended precision numbers on 80 bits, of which 63 are significant.
Then only numbers smaller than 100000 times the largest will incur truncation.
Should you really worry ?
Upvotes: 2
Reputation: 28921
If the format of the floats is known, such as IEEE, then an array indexed by the exponent of a float can be used to store partial sums, then summed up to produce the total sum. During the array update, only floats with the same exponent are added together and stored into the array in the appropriate location. The final summation goes from smallest to largest. For C++, the array and functions could be members of a class.
Example for floats where the array is passed as a parameter to the functions:
/* clear array */
void clearsum(float asum[256])
{
size_t i;
for(i = 0; i < 256; i++)
asum[i] = 0.f;
}
/* add a number into array */
void addtosum(float f, float asum[256])
{
size_t i;
while(1){
/* i = exponent of f */
i = ((size_t)((*(unsigned int *)&f)>>23))&0xff;
if(i == 0xff){ /* max exponent, could be overflow */
asum[i] += f;
return;
}
if(asum[i] == 0.f){ /* if empty slot store f */
asum[i] = f;
return;
}
f += asum[i]; /* else add slot to f, clear slot */
asum[i] = 0.f; /* and continue until empty slot */
}
}
/* return sum from array */
float returnsum(float asum[256])
{
float sum = 0.f;
size_t i;
for(i = 0; i < 256; i++)
sum += asum[i];
return sum;
}
Upvotes: 2