Reputation: 1856
What is the difference between tf.float16 and tf.bfloat16 as listed in https://www.tensorflow.org/versions/r0.12/api_docs/python/framework/tensor_types ?
Also, what do they mean by "quantized integer"?
Upvotes: 13
Views: 12044
Reputation: 1261
Here is the picture describe the internals of three floating point formats:
For more information see BFloat16: The secret to high performance on Cloud TPUs
Upvotes: 1
Reputation: 24651
bfloat16
is a tensorflow-specific format that is different from IEEE's own float16
, hence the new name. The b
stands for (Google) Brain.
Basically, bfloat16
is a float32
truncated to its first 16 bits. So it has the same 8 bits for exponent, and only 7 bits for mantissa. It is therefore easy to convert from and to float32
, and because it has basically the same range as float32
, it minimizes the risks of having NaN
s or exploding/vanishing gradients when switching from float32
.
From the sources:
// Compact 16-bit encoding of floating point numbers. This representation uses // 1 bit for the sign, 8 bits for the exponent and 7 bits for the mantissa. It // is assumed that floats are in IEEE 754 format so the representation is just // bits 16-31 of a single precision float. // // NOTE: The IEEE floating point standard defines a float16 format that // is different than this format (it has fewer bits of exponent and more // bits of mantissa). We don't use that format here because conversion // to/from 32-bit floats is more complex for that format, and the // conversion for this format is very simple.
As for quantized integers, they are designed to replace floating points in trained networks to speed up processing. Basically, they are a sort of fixed point encoding of real numbers, albeit with an operating range that is chosen to represent the observed distribution at any given point of the net.
More on quantization here.
Upvotes: 27