How does dynamic range quantization and full integer quantization optimize in TensorFlow Lite?

Question

I'm currently working with TensorFlow Lite and I'm trying to understand the difference between dynamic range quantization (DRQ) and full-integer quantization (FIQ). I understand that in the first one (DRQ) only the weights are quantized, and in the second one (FIQ), both the weights and activations (outputs) are quantized.

However, I'm not sure I fully understand what this means. Regarding the quantization of the weights, are they simply cast from float32 to int8, or another kind of operation is made? As well, why is it needed a representative dataset to quantize the activations in FIQ?

Also, I'm wondering if, for example, a layer of the neural network has sigmoid activation, this means that in FIQ all the outputs of this layer will be 'mapped' to either 0 or 1 (not taking any other of the possible values in-between)?

Finally, in the DRQ section you can read this sentence: "At inference, weights are converted from 8-bits of precision to floating point and computed using floating-point kernels". If the weights are converted from 8-bits to floating point precision at inference, what advantages (besides the smaller size of the network) do we get for dynamic range quantizing a model, in contraposition to a TensorFlow Lite model with no quantization at all? Wouldn't the model be faster if this conversion wasn't done (operate with int precission)?

How does dynamic range quantization and full integer quantization optimize in TensorFlow Lite?

Answers (1)

Related Questions