Sergio0694
Sergio0694

Reputation: 4567

Find minimum float greater than a double value

I was having an issue with the CUDNN_BN_MIN_EPSILON value being used in the cudnnBatchNormalizationForwardTraining function (see the docs here), and it turns out it was because I was passing the float value 1e-5f instead of double (I'm working with float values to save memory and speed up computation), and this value once converted to float was slightly less than 1e-5, which is the actual value of that constant.

After some trial and error, I found a decent approximation I'm now using:

const float CUDNN_BN_MIN_EPSILON = 1e-5f + 5e-13f;

I'm sure there's a better way to approach problems like this, so the question is:

Given a positive double value, what is the best (as in "reliable") way to find the minimum possible float value which (on its own and if/when converted to double) is strictly greater than the initial double value?

Another way to formulate this problem is that, given a double value d1 and a float value f1, d1 - (float)f1 should be the minimum possible negative value (as otherwise it'd mean that f1 was less than d1, which is not what we're looking for).

I did some basic trial and error (using 1e-5 as my target value):

// Check the initial difference
> 1e-5 - 1e-5f
2,5262124918247909E-13 // We'd like a small negative value here

// Try to add the difference to the float value
> 1e-5 - (1e-5f + (float)(1e-5 - 1e-5f))
2,5262124918247909E-13 // Same, probably due to approximation

// Double the difference (as a test)
> 1e-5 - (1e-5f + (float)((1e-5 - 1e-5f) * 2))
-6,5687345259044915E-13 // OK

With this approximation, the final float value is 1,00000007E-05, which looks fine.

But, that * 2 multiplication was completely arbitrary on my end, and I'm not sure it'll be reliable or the optimum possible thing to do there.

Is there a better way to achieve this?

Thanks!


EDIT: this is the (bad) solution I'm using now, will be happy to replace it with a better one!

/// <summary>
/// Returns the minimum possible upper <see cref="float"/> approximation of the given <see cref="double"/> value
/// </summary>
/// <param name="value">The value to approximate</param>
public static float ToApproximatedFloat(this double value)
    => (float)value + (float)((value - (float)value) * 2);

SOLUTION: this is the final, correct implementation (thanks to John Bollinger):

public static unsafe float ToApproximatedFloat(this double value)
{
    // Obtain the bit representation of the double value
    ulong bits = *((ulong*)&value);

    // Extract and re-bias the exponent field
    ulong exponent = ((bits >> 52) & 0x7FF) - 1023 + 127;

    // Extract the significand bits and truncate the excess
    ulong significand = (bits >> 29) & 0x7FFFFF;

    // Assemble the result in 32-bit unsigned integer format, then add 1
    ulong converted = (((bits >> 32) & 0x80000000u)
                        | (exponent << 23)
                        | significand) + 1;

    // Reinterpret the bit pattern as a float
    return *((float*)&converted);
}

Upvotes: 2

Views: 585

Answers (2)

John Bollinger
John Bollinger

Reputation: 180113

Inasmuch as you seem interested in the representation-level details, you'll be dependent on the representations of types float and double. In practice, however, it is very likely that that comes down to the basic "binary32" and "binary64" formats of IEEE-754. These have the general form of one sign bit, several bits of biased exponent, and a bunch of bits of significand, including, for normalized values, one implicit bit of significand.

Simple case

Given a double in IEEE-754 binary64 format whose value is no less than +2-126, what you want to do is

  • obtain the bit pattern of the original double value in a form that can be directly examined and manipulated. For example, as an unsigned 64-bit integer.

    double d = 1e-5;
    uint64_t bits;
    memcpy(&bits, &d, 8);
    
  • extract and re-bias the exponent field

    uint64_t exponent = ((bits >> 52) & 0x7FF) - 1023 + 127;
    
  • extract the significand bits and truncate the excess

    uint64_t significand = (bits >> 29) & 0x7fffff;
    
  • assemble the result in 32-bit unsigned integer format

    uint32_t float_bits = ((bits >> 32) & 0x80000000u)
            | (exponent << 23)
            | significand;
    
  • add one. Since you want a result strictly greater than the original double, this is correct regardless of whether all of the truncated significand bits were 0. It will correctly increment the exponent field if the addition overflows the significand bits. It may, however, produce the bit pattern of an infinity.

    float_bits += 1;
    
  • store / copy / reinterpret the bit pattern as that of a float

    float f;
    
    memcpy(&f, &float_bits, 4);
    

Negative numbers

Given a negative double in binary64 format whose magnitude is no less than 2-126, follow the above procedure except subtract 1 from float_bits instead of adding one. Note that for exactly -2-126, this produces a subnormal binary32 (see below), which is the correct result.

Zeroes and very small numbers, including subnormals

IEEE 754 provides reduced-precision representations of non-zero numbers of very small magnitude. Such representations are called subnormal. Under some circumstances the minimum binary32 exceeding a given input binary64 is a subnormal, including for some inputs that are not binary64 subnormals.

Also, IEEE 754 provides signed zeroes, and -0 is a special case: the minimum binary32 strictly greater than -0 (either format) is the smallest positive subnormal number. Note: not +0, because according to IEEE 754, +0 and -0 compare equal via the normal comparison operators. The minimum positive, nonzero, subnormal binary32 value has bit pattern 0x00000001.

The binary64 values subject to these considerations have biased binary64 exponent fields with values less than or equal to the difference between the binary64 exponent bias and the binary32 exponent bias (896). This includes those with biased exponents of exactly 0, which characterize binary64 zeroes and subnormals. Examination of the rebiasing step in the simple-case procedure should lead you to conclude, correctly, that that procedure will produce the wrong result for such inputs.

Code for these cases is left as an exercise.

Infinities and NaNs

Inputs with all bits of the biased binary64 exponent field set represent either positive or negative infinity (when the binary64 significand has no bits set) or a not-a-number (NaN) value. Binary64 NaNs and positive infinity should convert to their binary32 equivalents. Negative infinity should perhaps convert to the negative binary32 value of greatest magnitude. These need to be handled as special cases.

Code for these cases is left as an exercise.

Upvotes: 0

Eric Postpischil
Eric Postpischil

Reputation: 222362

In C:

#include <math.h>

float NextFloatGreaterThan(double x)
{
    float y = x;
    if (y <= x) y = nexttowardf(y, INFINITY);
    return y;
}

If you do not want to use library routines, then replace nexttowardf(y, INFINITY) above with -NextBefore(-y), where NextBefore is taken from this answer and modified:

  • Change double to float and DBL_ to FLT_.
  • Change .625 to .625f.
  • Replace fmax(SmallestPositive, fabs(q)*Scale) with SmallestPositive < fabs(q)*Scale ? fabs(q)*Scale : SmallestPositive.
  • Replace fabs(q) with (q < 0 ? -q : q).

(Obviously, the routine could be converted from -NextBefore(-y) to NextAfter(y). That is left as an exercise for the reader.)

Upvotes: 4

Related Questions