Geoffrey Irving
Geoffrey Irving

Reputation: 6613

Fast, portable conversion of 24 bit ints to float without losing a bit

When generating random uniform floats in [0, 1), TensorFlow currently uses bit twiddling to convert 23 bit integers to floats in [1, 2), then subtracts one:

// Helper function to convert an 32-bit integer to a float between [0..1).
PHILOX_DEVICE_INLINE float Uint32ToFloat(uint32 x) {
  // IEEE754 floats are formatted as follows (MSB first):
  //    sign(1) exponent(8) mantissa(23)
  // Conceptually construct the following:
  //    sign == 0
  //    exponent == 127  -- an excess 127 representation of a zero exponent
  //    mantissa == 23 random bits
  const uint32 man = x & 0x7fffffu;  // 23 bit mantissa
  const uint32 exp = static_cast<uint32>(127);
  const uint32 val = (exp << 23) | man;

  // Assumes that endian-ness is same for float and uint32.
  float result;
  memcpy(&result, &val, sizeof(val));
  return result - 1.0f;
}

This irritates me, since subtracting one means we only get 23 bits of precision, rather than the 24 bits available. Unfortunately, the naive algorithm is about 9% slower on CPUs (it's the same speed on GPU):

// Helper function to convert an 32-bit integer to a float between [0..1).
PHILOX_DEVICE_INLINE float Uint32ToFloat(uint32 x) {
  return 0x1p-32f * static_cast<float>(x);
}

I also tried explicitly truncating to 24 bits in case it would teach the compiler that rounding mode flags didn't matter; this didn't solve the problem:

PHILOX_DEVICE_INLINE float Uint32ToFloat(uint32 x) {
  return 0x1p-24f * static_cast<float>(x & ((1 << 24) - 1));
}

Is there a way to get the full 24 bits of available precision without sacrificing performance? I'm pretty sure I could do it in assembly, but portability is required.

Note that the remaining 8 bits of precision that one can get sometimes for small floats are not interesting: I only care about the one missing bit.

Upvotes: 3

Views: 969

Answers (2)

dga
dga

Reputation: 21927

You can use __builtin_clz to directly adjust the exponent and map the remainder of the number as the mantissa, avoiding the floating point subtraction and the loss of precision:

float Uint32ToFloat(uint32_t x) {
  // IEEE754 floats are formatted as follows (MSB first):
  //    sign(1) exponent(8) mantissa(23)
  // Conceptually construct the following:
  //    sign == 0
  //    exponent == 126  -- an excess 127 representation of a -1 exponent
  //    mantissa == 23 random bits
  uint32_t exp = static_cast<uint32_t>(126);

  auto lz = __builtin_clz(x);
  exp -= lz;
  x <<= (lz+1);  // +1 to chop off implicit 1 in FP representation.
  const uint32_t man = x >> 9;  // 23 bit mantissa.
  const uint32_t val = (exp << 23) | man;

  // Assumes that endian-ness is same for float and uint32.
  float result;
  memcpy(&result, &val, sizeof(val));
  return result;
}

Note that the CUDA equivalent of gcc's __builtin_clz is __clz().

Advantages: Preserves as much precision as possible from the original random number.

Drawback: I think the original version vectorizes better and has a bit less instruction latency.

A third alternative is to directly adjust the exponent after letting the FP hardware do the conversion from the integer:

inline float Uint32ToFloat_bit(uint32_t x) {
  float f(x);
  uint32_t f_as_int;
  memcpy(&f_as_int, &f, sizeof(f_as_int));
  f_as_int -= (32 << 23);  // Subtract 32 from the exponent.
  float result;
  memcpy(&result, &f_as_int, sizeof(f_as_int));
  return result;
}

This was faster than the builtin_clz version for me, but slower than your basic one, but I again suspect it's very context-dependent. This one might vectorize quite well - but so does your very basic one, since it's just a vmulss.

After having gone through all of this, I think that the best step would be a timing harness that produces a batch of 8 random numbers and then bulk converts them, to let the compiler vectorize the conversions, and then see which is best.

Upvotes: 3

Pascal Cuoq
Pascal Cuoq

Reputation: 80325

You can try not doing the subtraction when the 24th bit is set:

  …
  const uint32 exp = static_cast<uint32>(126); // 0.5
  …
  if ((x & 0x800000) == 0) result -= 0.5f;
  return result;
}

However, just 9% penalty for the 24th bit is already pretty good, and this will not necessarily be faster. (Here you are sometimes avoiding the price of a subtraction, but always paying the price of a test and conditional branch. I will let you do the timings: the 0x800000 mask can be done in parallel with the rest but the cost of the conditional branch entirely depends on the distribution of the values in practice.)

This can easily be made branchless for GPUs by always doing the subtraction followed by a conditional move, but the compiler should do this automatically.

Upvotes: 3

Related Questions