Reputation: 714
For research purposes, I'm trying to understand how TF Lite does its inference. I'm interested only in the software logic.
I'm using TensorFlow 2.1 and TensorFlow Model Optimization 0.3.0.
As an example, I use a very simple fully connected network:
tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
tf.keras.layers.Dense(10, activation=None)
])
I train the network on mnist with quantized aware training.
And then quantize the network with TF Lite:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = data_generator(ds_train)
quantized_tflite_model = converter.convert()
In order to make sure that I know what I'm doing I did 3 things: I used TF to get outputs from the 32 bit model. I used TF Lite to get outputs from the quantized model. I implemented in Python the forward pass for the 32 bit model and compared its outputs to the previous 2.
Now I'm trying to understand how to implement the forward pass of the quantized model.
Using interpreter.get_tensor_details(), I get the following output:
{'name': 'Identity', 'index': 0, 'shape': array([ 1, 10]), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}
{'name': 'flatten_input_int8', 'index': 1, 'shape': array([ 1, 28, 28, 1]), 'dtype': <class 'numpy.int8'>, 'quantization': (0.003921568859368563, -128)}
{'name': 'sequential/quant_dense/BiasAdd', 'index': 2, 'shape': array([ 1, 10]), 'dtype': <class 'numpy.int8'>, 'quantization': (0.22868551313877106, 49)}
{'name': 'sequential/quant_dense/LastValueQuant/FakeQuantWithMinMaxVars/transpose', 'index': 3, 'shape': array([ 10, 784]), 'dtype': <class 'numpy.int8'>, 'quantization': (0.01087072491645813, 0)}
{'name': 'sequential/quant_dense/MatMul_bias', 'index': 4, 'shape': array([10]), 'dtype': <class 'numpy.int32'>, 'quantization': (4.263029768480919e-05, 0)}
{'name': 'sequential/quant_dense/BiasAdd_float', 'index': 5, 'shape': array([ 1, 10]), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}
{'name': 'flatten_input', 'index': 6, 'shape': array([ 1, 28, 28, 1]), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}
I'm using this paper as a reference: https://arxiv.org/pdf/1712.05877.pdf I also read this page: https://www.tensorflow.org/lite/performance/quantization_spec
My current implementation goes like this:
def quantization_params(index):
return tensor_details[index]['quantization'][0], tensor_details[index]['quantization'][1]
image = get_single_test_image(show_image=False)
# #### Convert input image from float32 to int8 ####
q_scale, q_zero = quantization_params(index=1)
x = image / q_scale + q_zero
# #### Flatten input ####
x = x.flatten()
# #### Dense layer ####
kernel, bias = tflite_model.interpreter.get_tensor(3), tflite_model.interpreter.get_tensor(4)
s_input, z_input = quantization_params(index=1)
s_kernel, z_kernel = quantization_params(index=3)
s_output, z_output = quantization_params(index=4)
M = s_input * s_kernel
quantized_multiplier, right_shift = quantize_multiplier_smaller_than_one(M)
dense_output = np.zeros((kernel.shape[0],), dtype=np.int32)
for i in range(dense_output.shape[0]):
for j in range(kernel.shape[1]):
dense_output[i] += int((x[j] + z_input) * (kernel[i, j] + z_kernel))
x = dense_output + bias
x = np.right_shift(x * quantized_multiplier, right_shift)
the function quantize_multiplier_smaller_than_one is my Python implementation for the C function here: https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc
So my questions here are, is this the correct approach? I'm definitely missing some calculation here, what is it? And also, when I have a bigger network, how do I know how to systematically use the correct indexes to pull the quantization params for each layer.
Many thanks for any advice.
Upvotes: 2
Views: 439
Reputation: 714
At last, I solved this issues by digging into TensorFlow/Lite code. I found the relevant code and modified it, so it printed all the relevant info that I needed into text files. From there I could parse everything in Python and run a Pythonic version of the cpp logic.
In case someone will want to try and do the same, in order to build the CPP solution go to build from source
The entry point of a sample app is here: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/examples/minimal
And for example, the convolution reference code is here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/reference/integer_ops/conv.h
Enjoy (not really)
Upvotes: 1