C. Kim
C. Kim

Reputation: 41

Tensorflow Lite inference - how do I scale down the convolution layer outputs?

I built a simple CNN model with one convolutional layer and converted it with Tensorflow Lite. (for MNIST!!) So now my model gets 8-bit integer inputs and weights are 8-bit integers too.

I wanted to test the parameters I got from TFLite, so I wrote C code for the inference step.

Input image pixels were given the 8-bit integers between 0 and 255 and weights were between -128~127. (Biases were 32-bit integers.) The convolution results, of course, consisted of numbers bigger than 255.

I checked this paper(https://arxiv.org/pdf/1712.05877.pdf, "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference") and it had some tips for what to do to this convolution result. It said I had to (1) scale down, (2) cast down (to uint8), and (3) apply the activation function to generate 8-bit output.

To my understanding, I needed to multiply 2^(-n) to the convolution results. So I divided the convolution outputs in 256 and limited the max number to 255, and further calculated them with fully connected layer's weights.

It showed a good result(accuracy 0.96+), but it was not as high as TFLite evaluation said. (accuracy 0.98+)

I don't think I did it in the right way because "256"(that I divided the convolution outputs into) was a random number. And actually when I changed it to 340, it showed the best result, but still far less than TFLite evaluation with TFLite Interpreter.

What is the correct and sophisticated way to implement the inference step? How do I scale down?

Upvotes: 4

Views: 1972

Answers (1)

T.J. Alumbaugh
T.J. Alumbaugh

Reputation: 56

This is a great question about the fundamentals of quantization in TF Lite. The paper you mention is a great reference and a guide for understanding the underlying math. TF Lite now uses a slightly different quantization scheme from the paper above, but still supports that scheme in full for models that were converted prior to the implementation of the current scheme. For your reference, you can see details of the new quantization scheme here:

https://www.tensorflow.org/lite/performance/quantization_spec

The answer to your question applies equally well to all quantization schemes in TF Lite. To the particulars of your question, you wish to understand how one can go from the 32-bit accumulator (the result of adding up all of the products of activation * filter) down to the quantized value (either uint8 or int8). From the paper, you can see that the matrix multiplication (it's a similar arithmetic for the convolution case you are interested in) is done with all integer operations, except for the real-valued multiplier M defined in Equation 5 in Section 2.2. The goal of the quantization scheme is to perform all of the math operations in integer-only arithmetic, so the challenge is then 'how do I multiply by real-valued M with only integer operations?'.

The "trick" is to represent M as is done in equation 6, as the product of 2 raised to some negative exponent times M_0, which is a real number of at least 0.5 and bound from above by 1. At first glance, that does not appear to make our problem any easier. However, first consider the 2^(-n) part. This can be represented on a computer as a just a bit shift (I'll talk about rounding in a second). Assuming any rounding issues are handled, that part is easy to do with only integer arithmetic. Now for the M_0 part. By construction, we have bound M_0 to a range where we can used a fixed point representation with an integer type (e.g. int32) and use all the bits as fractional bits (if you are not familiar with fixed point representation you might need to refer to outside sources of information).

We call the 32-bit fixed point representation of M_0 the "quantized multiplier". You can see the particulars of the operation in the link below, but, essentially, multiplying the accumulator by the quantized multiplier involves a standard integer multiplication resulting in a 64-bit number, and then taking the high 32-bits of that result.

The actual code is a bit difficult to go through because there are various issues to handle with proper rounding (as discussed in the paper), overflows, value saturation, clamping, etc. You can get started on understanding it though by looking at the reference implementation here:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/common.h#L153-L162

where SaturatingRoundingDoublingHighMul implements the fixed point multiplication by the quantized multiplier and RoundingDivideByPOT implements the multiplication by 2^(-n).

In actual code run on devices, TF Lite uses various kinds of optimized instructions to implement this arithmetic, but the reference code gets the same answer and is easier to inspect and understand. Hope that helps!

Upvotes: 4

Related Questions