artificial_lunacy
artificial_lunacy

Reputation: 23

C++ floating point type that doesn't use scientific notation to represent very small values?

I'm writing an algorithm that deals with Logit-normal distributions transformations which have the characteristic that, with high enough sigmas/means, the values can be "compressed" towards the edges at 0 and 1 very tightly. I'm encountering the problem where values that should be different collide because the C++ double can't represent sufficient digits. For example, I'm applying the sigmoid/logistic function to two high enough but different values, the outcome is identical when it shouldn't be if doubles could hold higher precision.

I can deal with that by aggregating samples that collide onto the same value. But the issue is that around 1, the collision happens, because numbers are represented like this: 0.999999999953 for example. But around 0 it doesn't happen, because C++ starts to represent values in scientific notation, like so: 0.12949384e-300. The latter can achieve much higher precision therefore no collision happens here. But I would need symmetrical behavior for both the values around 0 and 1 to make sure the result is computed the same way on both sides.

Is there such a type in C++ that does not automatically switch to scientific notations when representing extremely small values? So I could achieve the same behavior around 0 that I get around 1?

Upvotes: -1

Views: 208

Answers (2)

phuclv
phuclv

Reputation: 42002

Is there such a type in C++ that does not automatically switch to scientific notations when representing extremely small values?

There's no such thing as "automatically switch to scientific notations when representing extremely small values". The output format is specified by the programmer. In C++ the output format can be controlled by input/output manipulators, std::format, or printf format specifiers. For example you can use std::fixed << std::setprecision(800) to tell std::cout to not use scientific notation for very small or very large values and print 800 digits of precision

That means saying "C++ starts to represent values in scientific notation" is wrong. It's just that you're not using the correct format, for example %e or %g in printf. However changing the output format won't help because the real problem you're having is the lack of precision and/or range

C++ floating-point types always have a fixed format and fixed precision. They're stored internally in memory in the form (-1)sign × M × baseN which can be seen as similar to scientific notation when base = 10. Due to the nature of the formula, the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers varies with the chosen scale. Specifically for IEEE-754, due to the implied 1 most significant bit, values around 1 are spaced the closest, although denormals will have fixed spacing between values. However if you go into to denormal zone you'll have serious performance issue and also lose precision gradually at the other end of the range


One way to probably fix this if you choose to stick with the same double type is to rearrange the formula to use techniques and functions that are designed for cases like this:

  • For exponentiation and logarithm use std::log1p and std::expm1 instead of std::log and std::exp. You'll need to rewrite the function into variants for values around 1 and far from that and use the correct function to minimize errors. Also consider std::fma for multiply-add/subtract operations. There are also other functions for improving precision in some specific usecases like std::cbrt, std::hypot, std::exp2 although I'm not sure if they help you or not
  • For summing do from the smallest value to the largest one, or use Kahan summation

To get higher precision and range (because you need values very close to 0) you have to use a custom type. The most commonly used formats use the same form above, but with a larger significand and/or exponent

There are some other floating-point formats like

but as of now they're not hardware-accelerated, so for your usecase fixed-precision or double-double arithmetic is the best choice

Upvotes: 0

Henrique Bucher
Henrique Bucher

Reputation: 4474

You can use the type long double. It has typically on x86 80 bits precision which is the original precision of the 8087 coprocessor.

Note that Microsoft dropped support for it long ago.

GCC supports the type __float128 that implements IEEE754 quadruple precision. This is nice since if you use Linux its probably already there.

Boost multiprecision offers a 128 bit cross platform facade for the above. However you would typically bring the boost library dependency into your application, which can be daunting in certain scenarios. UPDATE: boost multiprecision's float128 does not compile on MSVC/Windows.

Example:

#include <iostream>
#include <limits>
#include <cmath>
#include "boost/format.hpp"
#include <boost/multiprecision/float128.hpp>

namespace mp = boost::multiprecision;

template< typename T >
[[gnu::noinline]] T calc(T a,T b,T c) {
    return T(1) - (T(1)/a + T(1)/b + T(1)/c );
}

int main() {
    std::cout << "     Epsilon        Error"<< std::endl;
    boost::format fmt( "%12.6g %12.6g" );
    std::cout << fmt % std::numeric_limits<double>::epsilon() % calc<double>(3,2,6) << std::endl;
    std::cout << fmt % std::numeric_limits<long double>::epsilon() % calc<long double>(3,2,6) << std::endl;
    std::cout << fmt % double(std::numeric_limits<__float128>::epsilon()) % double(calc<__float128>(3,2,6)) << std::endl;
    std::cout << fmt % std::numeric_limits<mp::float128>::epsilon() % calc<mp::float128>(3,2,6) << std::endl;
}

Results in

Program stdout
     Epsilon        Error
 2.22045e-16  1.11022e-16
  1.0842e-19            0
           0  9.62965e-35
 1.92593e-34  9.62965e-35

Compiler explorer link

However note there is a huge performance penalty for the 128 bit precision. I wrote a benchmark using the above formula and the results are:

-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
Calc<double>           375575 ns       375514 ns         1864
Calc<long double>      531313 ns       531228 ns         1318
Calc<__float128>     12943995 ns     12941473 ns           54
Calc<mp::float128>   13074108 ns     13071816 ns           54

Upvotes: 1

Related Questions