How to implement TF Lite inference in Python

Question

For research purposes, I'm trying to understand how TF Lite does its inference. I'm interested only in the software logic.

I'm using TensorFlow 2.1 and TensorFlow Model Optimization 0.3.0.

As an example, I use a very simple fully connected network:

tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
    tf.keras.layers.Dense(10, activation=None)
])

I train the network on mnist with quantized aware training.

And then quantize the network with TF Lite:

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = data_generator(ds_train)
quantized_tflite_model = converter.convert()

In order to make sure that I know what I'm doing I did 3 things: I used TF to get outputs from the 32 bit model. I used TF Lite to get outputs from the quantized model. I implemented in Python the forward pass for the 32 bit model and compared its outputs to the previous 2.

Now I'm trying to understand how to implement the forward pass of the quantized model.

Using interpreter.get_tensor_details(), I get the following output:

{'name': 'Identity', 'index': 0, 'shape': array([ 1, 10]), 'dtype': , 'quantization': (0.0, 0)}
{'name': 'flatten_input_int8', 'index': 1, 'shape': array([ 1, 28, 28,  1]), 'dtype': , 'quantization': (0.003921568859368563, -128)}
{'name': 'sequential/quant_dense/BiasAdd', 'index': 2, 'shape': array([ 1, 10]), 'dtype': , 'quantization': (0.22868551313877106, 49)}
{'name': 'sequential/quant_dense/LastValueQuant/FakeQuantWithMinMaxVars/transpose', 'index': 3, 'shape': array([ 10, 784]), 'dtype': , 'quantization': (0.01087072491645813, 0)}
{'name': 'sequential/quant_dense/MatMul_bias', 'index': 4, 'shape': array([10]), 'dtype': , 'quantization': (4.263029768480919e-05, 0)}
{'name': 'sequential/quant_dense/BiasAdd_float', 'index': 5, 'shape': array([ 1, 10]), 'dtype': , 'quantization': (0.0, 0)}
{'name': 'flatten_input', 'index': 6, 'shape': array([ 1, 28, 28,  1]), 'dtype': , 'quantization': (0.0, 0)}

I'm using this paper as a reference: https://arxiv.org/pdf/1712.05877.pdf I also read this page: https://www.tensorflow.org/lite/performance/quantization_spec

My current implementation goes like this:

def quantization_params(index):
    return tensor_details[index]['quantization'][0], tensor_details[index]['quantization'][1]

image = get_single_test_image(show_image=False)

# #### Convert input image from float32 to int8 ####

q_scale, q_zero = quantization_params(index=1)
x = image / q_scale + q_zero

# #### Flatten input ####

x = x.flatten()

# #### Dense layer ####

kernel, bias = tflite_model.interpreter.get_tensor(3), tflite_model.interpreter.get_tensor(4)
s_input, z_input = quantization_params(index=1)
s_kernel, z_kernel = quantization_params(index=3)
s_output, z_output = quantization_params(index=4)

M = s_input * s_kernel
quantized_multiplier, right_shift = quantize_multiplier_smaller_than_one(M)

dense_output = np.zeros((kernel.shape[0],), dtype=np.int32)

for i in range(dense_output.shape[0]):
    for j in range(kernel.shape[1]):
        dense_output[i] += int((x[j] + z_input) * (kernel[i, j] + z_kernel))

x = dense_output + bias

x = np.right_shift(x * quantized_multiplier, right_shift)

the function quantize_multiplier_smaller_than_one is my Python implementation for the C function here: https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc

So my questions here are, is this the correct approach? I'm definitely missing some calculation here, what is it? And also, when I have a bigger network, how do I know how to systematically use the correct indexes to pull the quantization params for each layer.

Many thanks for any advice.

How to implement TF Lite inference in Python

Answers (1)

Related Questions