RajeshDA
RajeshDA

Reputation: 481

std::pow produce different result in 32 bit and 64 bit application

I have found the mismatch in the result of some complex calculations. When i thoroughly observed the intermediate results, its the std::pow function which creates that mismatch. Below are the inputs/output.

long double dvalue = 2.7182818284589998;
long double dexp = -0.21074699576017999;
long double result = std::powl( dvalue, dexp); 

64bit -> result = 0.80997896907296496 and 32bit -> result = 0.80997896907296507

I am using VS2008. I have tried with other variation of pow function which takes long double and return long double, but still see the same difference.

double pow( double base, double exponent );

long double powl( long double base, long double exponent );

I have read some info on this:

Intel x86 processors use 80-bit extended precision internally, whereas double is normally 64-bit wide.Different optimization levels affect how often floating point values from CPU get saved into memory and thus rounded from 80-bit precision to 64-bit precision. Alternatively, use the long double type, which is normally 80-bit wide on gcc to avoid rounding from 80-bit to 64-bit precision.

Could someone make me clearly understand the difference and ways to overcome this difference.

Upvotes: 7

Views: 2682

Answers (4)

Adrian McCarthy
Adrian McCarthy

Reputation: 47962

What's probably happening is that the 32-bit build is using the 80-bit FPU registers to do the calculation and the 64-bit build is using the SIMD operations using 64-bit values, causing a slight discrepancy. Note that both answers agree to 14 decimal places, which is about the best you can hope for with 64-bit floating point values.

Visual C++ offers compiler options that let you say whether you prefer speed, consistency, or precision with regard to floating point operations. Using those options (e.g., /fp:strict), you can probably get consistent values between the two builds if that's important to you.

Also note that VC++2008 is rather old. Newer versions have fixes for many bugs, including some related to floating point. (Popular implementations of strtod in open source software have had bugs detected and fixed since 2008.) In addition to the precision difference between 80-bit and 64-bit operations, you may also be encountering parsing and display bugs. Nonetheless, floating point is hard, and bugs persist.

Upvotes: 5

Cubbi
Cubbi

Reputation: 47428

You used literals of type double, not long double (you forgot the suffix). This means when you wrote 2.7182818284589998 (an impossible value for a double), the compiler had to choose between 2.718281828458999793696193592040799558162689208984375 and 2.71828182845899934960698374197818338871002197265625,

and when you wrote -0.21074699576017999 (another impossible value for a double), the compiler had to choose between -0.2107469957601799948054832611887832172214984893798828125 and -0.210746995760179967049907645559869706630706787109375.

With default rounding to nearest, the values you stored in dvalue and dexp were 2.718281828458999793696193592040799558162689208984375 and -0.2107469957601799948054832611887832172214984893798828125 (storing a double in a long double does not change its value)

The result of pow should be close to 0.8099789690729650165287354526069381795064774873497553965297999359066924950079080502973738475702702999114990234375, which then has to be placed in the return type, in your case should be long double (except that MSVC doesn’t distinguish them from double, as far as I recall and as your results show)

placing the result in a 64-bit double, we have to choose between 0.80997896907296496049610823320108465850353240966796875 and 0.80997896907296507151841069571673870086669921875.

The correct answer (rounding to nearest) is 0.80997896907296507151841069571673870086669921875 and that is exactly what you got in the "32bit result", truncated as 0.80997896907296507.

Your "64bit result" appears to be exactly the other 64-bit double value, rounded the wrong way from the correct result (and truncated as 0.80997896907296496). I would consider that a QoI bug: gcc, clang, intel, and oracle all give the only one, correct, result (even though they don't have to: IEEE precision requirements for pow allow more than 0.5 ulp of error)

incidentally, if your pow returned an Intel 80-bit long double, it would have to fit between 0.8099789690729650164951504420773886749884695746004581451416015625 and 0.809978969072965016549360550701663896688842214643955230712890625, with the latter being the nearest.

Upvotes: 2

sbabbi
sbabbi

Reputation: 11181

From the wikipedia page on long double

On the x86 architecture, most C compilers implement long double as the 80-bit extended precision type supported by x86 hardware [...]. An exception is Microsoft Visual C++ for x86, which makes long double a synonym for double.

So when you are compiling on 32 bit long double = double, but on x64 long double is actually an 80 bit floating point, so the results are different.

Upvotes: 0

user1084944
user1084944

Reputation:

The most important thing to understand about floating-point calculations is that they are (almost always) inexact. Most numbers cannot be represented exactly as a floating-point number. And even when the result of a calculation can be represented exactly, the result actually computed may still not be exactly correct.

The way to deal with this is to write code that does not depend on getting exact results. For example, you should almost never test floating point numbers for equality. Or if you need to test if a number is positive, your program may need to reject extremely small positive numbers (they're approximately negative) or accept extremely small negative numbers (they're approximately positive).

Similarly, you should try to avoid numerically unstable algorithms, since these small errors blow up quickly; conversely, you should try to use numerically stable algorithms, since those are error tolerant.

How to do numerical calculations well is an entire field of study!

Upvotes: 2

Related Questions