Reputation: 198
I'm currently working with TensorFlow Lite and I'm trying to understand the difference between dynamic range quantization (DRQ) and full-integer quantization (FIQ). I understand that in the first one (DRQ) only the weights are quantized, and in the second one (FIQ), both the weights and activations (outputs) are quantized.
However, I'm not sure I fully understand what this means. Regarding the quantization of the weights, are they simply cast from float32 to int8, or another kind of operation is made? As well, why is it needed a representative dataset to quantize the activations in FIQ?
Also, I'm wondering if, for example, a layer of the neural network has sigmoid activation, this means that in FIQ all the outputs of this layer will be 'mapped' to either 0 or 1 (not taking any other of the possible values in-between)?
Finally, in the DRQ section you can read this sentence: "At inference, weights are converted from 8-bits of precision to floating point and computed using floating-point kernels". If the weights are converted from 8-bits to floating point precision at inference, what advantages (besides the smaller size of the network) do we get for dynamic range quantizing a model, in contraposition to a TensorFlow Lite model with no quantization at all? Wouldn't the model be faster if this conversion wasn't done (operate with int precission)?
Upvotes: 1
Views: 2755
Reputation: 632
Full-integer quantization requires the representative dataset to determine the min-max values of the inputs. These are required to properly determine the quantization nodes when the converter does the quantization of the model. In TF1.x it is possible to inject the fake quant nodes into the model by hand and seems like the fake quant nodes are still present in current versions of TensorFlow: Tensorflow documentation. The documentation page also answers your question on what kind of operation is made when quantizing the weights.
The same DRQ section you linked also mentions "This conversion is done once and cached to reduce latency".
Upvotes: 2