Ohad Meir
Ohad Meir

Reputation: 714

TensorFlow fake-quantize layers are also called from TF-Lite

I'm using TensorFlow 2.1 in order to train models with quantization-aware training.

The code to do that is:

import tensorflow_model_optimization as tfmot
model = tfmot.quantization.keras.quantize_annotate_model(model)

This will add fake-quantize nodes to the graph. These nodes should adjust the model's weights so they are more easier to be quantized into int8 and to work with int8 data.

When the training ends, I convert and quantize the model to TF-Lite like so:

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = [give data provider]
quantized_tflite_model = converter.convert()

At this point, I wouldn't expect to see the fake-quantize layers in the TL-Lite graph. But surprisingly, I do see them. Moreover, when I run this quantized model in TF-Lite C++ sample app, I see that it's also running the fake-quantize nodes during inference. In addition to that, it also dequantize and quantize the activations between each layer.

That's a sample of the output from the C++ code:

Node 0 Operator Builtin Code 80 FAKE_QUANT
Inputs: 1
Outputs: 237
Node 1 Operator Builtin Code 114 QUANTIZE
Inputs: 237
Outputs: 238
Node 2 Operator Builtin Code 3 CONV_2D
Inputs: 238 59 58
Outputs: 167
Temporaries: 378
Node 3 Operator Builtin Code 6 DEQUANTIZE
Inputs: 167
Outputs: 239
Node 4 Operator Builtin Code 80 FAKE_QUANT
Inputs: 239
Outputs: 166
Node 5 Operator Builtin Code 114 QUANTIZE
Inputs: 166
Outputs: 240
Node 6 Operator Builtin Code 3 CONV_2D
Inputs: 240 61 60
Outputs: 169

So I find all this very weird, taking also into account the fact that this model should run only on int8 and actually fake-quantize nodes are getting float32 as inputs.

Any help here would be appreciated.

Upvotes: 5

Views: 2335

Answers (3)

iamaman
iamaman

Reputation: 1

I have encountered the same issue. In my case, the quantized tflite model's size increases by ~3x with fake quantization. Does it occur to you? Inspecting the tflite graph in Netron shows quantization layers are inserted between every ops.

My workaround so far is to initiate a new copy of the model without fake quantization, and then load the weights by layers from the quantization-aware-trained model. It can't directly set weights to the whole model because fake quantization layers have parameters, too.

Upvotes: 0

lennart
lennart

Reputation: 29

You can force TF Lite to only use the INT operations:

converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

If an error occurs, then some layers of your network do not have an INT8 implementation yet.

Furthermore you could also try to investigate your network using Netron.

Nonetheless, if you also want to have INT8 inputs and output you also need to adjust those:

converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

However, there is currently an open issue regarding the in- and output, see Issue #38285

Upvotes: 0

Xianbao QIAN
Xianbao QIAN

Reputation: 123

representative_dataset is mostly used with post-training quantization.

Comparing your commands with QAT example, you probably want to remove that line .

https://www.tensorflow.org/model_optimization/guide/quantization/training_example

converter = tf.lite.TFLiteConverter.from_keras_model(q_aware_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

quantized_tflite_model = converter.convert()


# Create float TFLite model.
float_converter = tf.lite.TFLiteConverter.from_keras_model(model)
float_tflite_model = float_converter.convert()

# Measure sizes of models.
_, float_file = tempfile.mkstemp('.tflite')
_, quant_file = tempfile.mkstemp('.tflite')

with open(quant_file, 'wb') as f:
  f.write(quantized_tflite_model)

with open(float_file, 'wb') as f:
  f.write(float_tflite_model)

print("Float model in Mb:", os.path.getsize(float_file) / float(2**20))
print("Quantized model in Mb:", os.path.getsize(quant_file) / float(2**20))

Upvotes: 0

Related Questions