What is tf.bfloat16 "truncated 16-bit floating point"?

Question

What is the difference between tf.float16 and tf.bfloat16 as listed in https://www.tensorflow.org/versions/r0.12/api_docs/python/framework/tensor_types ?

Also, what do they mean by "quantized integer"?

P-Gn · Accepted Answer

bfloat16 is a tensorflow-specific format that is different from IEEE's own float16, hence the new name. The b stands for (Google) Brain.

Basically, bfloat16 is a float32 truncated to its first 16 bits. So it has the same 8 bits for exponent, and only 7 bits for mantissa. It is therefore easy to convert from and to float32, and because it has basically the same range as float32, it minimizes the risks of having NaNs or exploding/vanishing gradients when switching from float32.

From the sources:

// Compact 16-bit encoding of floating point numbers. This representation uses
// 1 bit for the sign, 8 bits for the exponent and 7 bits for the mantissa.  It
// is assumed that floats are in IEEE 754 format so the representation is just
// bits 16-31 of a single precision float.
//
// NOTE: The IEEE floating point standard defines a float16 format that
// is different than this format (it has fewer bits of exponent and more
// bits of mantissa).  We don't use that format here because conversion
// to/from 32-bit floats is more complex for that format, and the
// conversion for this format is very simple.

As for quantized integers, they are designed to replace floating points in trained networks to speed up processing. Basically, they are a sort of fixed point encoding of real numbers, albeit with an operating range that is chosen to represent the observed distribution at any given point of the net.

What is tf.bfloat16 "truncated 16-bit floating point"?

Answers (2)

Related Questions

What is tf.bfloat16 &quot;truncated 16-bit floating point&quot;?

Answers (2)

Related Questions

What is tf.bfloat16 "truncated 16-bit floating point"?