Reputation: 4567
I was having an issue with the CUDNN_BN_MIN_EPSILON value being used in the cudnnBatchNormalizationForwardTraining
function (see the docs here), and it turns out it was because I was passing the float
value 1e-5f
instead of double (I'm working with float
values to save memory and speed up computation), and this value once converted to float was slightly less than 1e-5
, which is the actual value of that constant.
After some trial and error, I found a decent approximation I'm now using:
const float CUDNN_BN_MIN_EPSILON = 1e-5f + 5e-13f;
I'm sure there's a better way to approach problems like this, so the question is:
Given a positive
double
value, what is the best (as in "reliable") way to find the minimum possiblefloat
value which (on its own and if/when converted todouble
) is strictly greater than the initialdouble
value?
Another way to formulate this problem is that, given a double
value d1
and a float
value f1
, d1 - (float)f1
should be the minimum possible negative value (as otherwise it'd mean that f1
was less than d1, which is not what we're looking for).
I did some basic trial and error (using 1e-5
as my target value):
// Check the initial difference
> 1e-5 - 1e-5f
2,5262124918247909E-13 // We'd like a small negative value here
// Try to add the difference to the float value
> 1e-5 - (1e-5f + (float)(1e-5 - 1e-5f))
2,5262124918247909E-13 // Same, probably due to approximation
// Double the difference (as a test)
> 1e-5 - (1e-5f + (float)((1e-5 - 1e-5f) * 2))
-6,5687345259044915E-13 // OK
With this approximation, the final float
value is 1,00000007E-05
, which looks fine.
But, that * 2
multiplication was completely arbitrary on my end, and I'm not sure it'll be reliable or the optimum possible thing to do there.
Is there a better way to achieve this?
Thanks!
EDIT: this is the (bad) solution I'm using now, will be happy to replace it with a better one!
/// <summary>
/// Returns the minimum possible upper <see cref="float"/> approximation of the given <see cref="double"/> value
/// </summary>
/// <param name="value">The value to approximate</param>
public static float ToApproximatedFloat(this double value)
=> (float)value + (float)((value - (float)value) * 2);
SOLUTION: this is the final, correct implementation (thanks to John Bollinger):
public static unsafe float ToApproximatedFloat(this double value)
{
// Obtain the bit representation of the double value
ulong bits = *((ulong*)&value);
// Extract and re-bias the exponent field
ulong exponent = ((bits >> 52) & 0x7FF) - 1023 + 127;
// Extract the significand bits and truncate the excess
ulong significand = (bits >> 29) & 0x7FFFFF;
// Assemble the result in 32-bit unsigned integer format, then add 1
ulong converted = (((bits >> 32) & 0x80000000u)
| (exponent << 23)
| significand) + 1;
// Reinterpret the bit pattern as a float
return *((float*)&converted);
}
Upvotes: 2
Views: 585
Reputation: 180113
Inasmuch as you seem interested in the representation-level details, you'll be dependent on the representations of types float
and double
. In practice, however, it is very likely that that comes down to the basic "binary32" and "binary64" formats of IEEE-754. These have the general form of one sign bit, several bits of biased exponent, and a bunch of bits of significand, including, for normalized values, one implicit bit of significand.
Given a double
in IEEE-754 binary64 format whose value is no less than +2-126, what you want to do is
obtain the bit pattern of the original double
value in a form that can be directly examined and manipulated. For example, as an unsigned 64-bit integer.
double d = 1e-5;
uint64_t bits;
memcpy(&bits, &d, 8);
extract and re-bias the exponent field
uint64_t exponent = ((bits >> 52) & 0x7FF) - 1023 + 127;
extract the significand bits and truncate the excess
uint64_t significand = (bits >> 29) & 0x7fffff;
assemble the result in 32-bit unsigned integer format
uint32_t float_bits = ((bits >> 32) & 0x80000000u)
| (exponent << 23)
| significand;
add one. Since you want a result strictly greater than the original double
, this is correct regardless of whether all of the truncated significand bits were 0. It will correctly increment the exponent field if the addition overflows the significand bits. It may, however, produce the bit pattern of an infinity.
float_bits += 1;
store / copy / reinterpret the bit pattern as that of a float
float f;
memcpy(&f, &float_bits, 4);
Given a negative double
in binary64 format whose magnitude is no less than 2-126, follow the above procedure except subtract 1 from float_bits
instead of adding one. Note that for exactly -2-126, this produces a subnormal binary32 (see below), which is the correct result.
IEEE 754 provides reduced-precision representations of non-zero numbers of very small magnitude. Such representations are called subnormal. Under some circumstances the minimum binary32 exceeding a given input binary64 is a subnormal, including for some inputs that are not binary64 subnormals.
Also, IEEE 754 provides signed zeroes, and -0 is a special case: the minimum binary32 strictly greater than -0 (either format) is the smallest positive subnormal number. Note: not +0, because according to IEEE 754, +0 and -0 compare equal via the normal comparison operators. The minimum positive, nonzero, subnormal binary32 value has bit pattern 0x00000001.
The binary64 values subject to these considerations have biased binary64 exponent fields with values less than or equal to the difference between the binary64 exponent bias and the binary32 exponent bias (896). This includes those with biased exponents of exactly 0, which characterize binary64 zeroes and subnormals. Examination of the rebiasing step in the simple-case procedure should lead you to conclude, correctly, that that procedure will produce the wrong result for such inputs.
Code for these cases is left as an exercise.
Inputs with all bits of the biased binary64 exponent field set represent either positive or negative infinity (when the binary64 significand has no bits set) or a not-a-number (NaN) value. Binary64 NaNs and positive infinity should convert to their binary32 equivalents. Negative infinity should perhaps convert to the negative binary32 value of greatest magnitude. These need to be handled as special cases.
Code for these cases is left as an exercise.
Upvotes: 0
Reputation: 222362
In C:
#include <math.h>
float NextFloatGreaterThan(double x)
{
float y = x;
if (y <= x) y = nexttowardf(y, INFINITY);
return y;
}
If you do not want to use library routines, then replace nexttowardf(y, INFINITY)
above with -NextBefore(-y)
, where NextBefore
is taken from this answer and modified:
double
to float
and DBL_
to FLT_
..625
to .625f
.fmax(SmallestPositive, fabs(q)*Scale)
with SmallestPositive < fabs(q)*Scale ? fabs(q)*Scale : SmallestPositive
.fabs(q)
with (q < 0 ? -q : q)
.(Obviously, the routine could be converted from -NextBefore(-y)
to NextAfter(y)
. That is left as an exercise for the reader.)
Upvotes: 4