TStancek
TStancek

Reputation: 318

Convert large int to float without rounding c++

Is there a fast and clean way to convert int32_t (or larger) to a largest representable value in float not larger than the original value stored in int32_t?

According to standard of IEEE754 (read only on wikipedia https://en.wikipedia.org/wiki/Single-precision_floating-point_format), conversion of large integers is done via rounding to nearest multiple of a power of 2. Which power depends on the size of that value.

However I would like to know, whether it is possible to do this conversion to a "largest float not larger" instead and do that in a clean way without complicated constructs, ideally by setting some flag or by using some built-in instruction(s)?

EDIT: I have a value x_int stored in int32_t or int64_t, and I want to convert it to a float value x_float, such that for those values (mathematically, not in a programming lagnuage)

x_int>=x_float

is always true. Possible workaround for int32_t is to use double, but I am not sure about int64_t.

Upvotes: 1

Views: 350

Answers (1)

Ben
Ben

Reputation: 35663

Behaviour may depend on compiler options in force. For example in msvc /fp:fast sacrifices correctness for speed. If this is not what you want specify /fp:strict or /fp:precise (the default). On Clang, -menable-unsafe-fp-math does something similar.

The floating point rounding mode is controlled by fesetround.

Retrieve the rounding mode using fegetround so you can restore it later, next use fesetround to set the rounding mode you want (in your case FE_TOWARDZERO if you mean smallest in magnitude, or FE_DOWNWARD otherwise) then cast it to a float. Finally restore the rounding mode.

inline float cast_with_mode(int32_t value, int mode){
    int prevmode = fegetround();
    if(prevmode == mode) return (float)value; // may be faster without this
    fesetround(mode);
    float result = (float)value;
    fesetround(prevmode);
    return result;
}

Performance wise, it may or may not be better to compare prevmode to mode. If it is already correct you don't need to either set it or restore it. Whether the comparison is faster or slower than the set/restore I don't know.

Example output (same on Clang and G++):

Mode           Value       Value          ResultBits   Result Value
FE_TOWARDZERO: 2147483520  0x7fffff80  => 4effffff     2147483520.000000
FE_UPWARD:     2147483520  0x7fffff80  => 4effffff     2147483520.000000
FE_TOWARDZERO: 2147483584  0x7fffffc0  => 4effffff     2147483520.000000
FE_UPWARD:     2147483584  0x7fffffc0  => 4f000000     2147483648.000000

Upvotes: 2

Related Questions