Why would a much lighter Keras model run at the same speed at inference as the much larger original model?

Question

I trained a Keras model with the following architecture:

def make_model(input_shape, num_classes):
    inputs = keras.Input(shape=input_shape)
    # Image augmentation block
    x = inputs
    # Entry block
    x = layers.experimental.preprocessing.Rescaling(1.0 / 255)(x)
    x = layers.Conv2D(32, 3, strides=2, padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)

    x = layers.Conv2D(64, 3, padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)

    previous_block_activation = x  # Set aside residual

    for size in [128, 256, 512, 728]:
        x = layers.Activation("relu")(x) 
        x = layers.SeparableConv2D(size, 3, padding="same")(x)
        x = layers.BatchNormalization()(x)

        x = layers.Activation("relu")(x) 
        x = layers.SeparableConv2D(size, 3, padding="same")(x)
        x = layers.BatchNormalization()(x)

        x = layers.MaxPooling2D(3, strides=2, padding="same")(x)

        # Project residual
        residual = layers.Conv2D(size, 1, strides=2, padding="same")(
            previous_block_activation
        )
        x = layers.add([x, residual])  # Add back residual
        previous_block_activation = x  # Set aside next residual

    x = layers.SeparableConv2D(1024, 3, padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)

    x = layers.GlobalAveragePooling2D()(x)
    if num_classes == 2:
        activation = "sigmoid"
        units = 1
    else:
        activation = "softmax"
        units = num_classes

    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(units, activation=activation)(x)
    return keras.Model(inputs, outputs)

And that model has over 2 million trainable parameters.

I then trained a much lighter model with only 300,000. trainable parameters:

def make_model(input_shape, num_classes):
    inputs = keras.Input(shape=input_shape) 
    # Image augmentation block
    x = inputs
    # Entry block
    x = layers.experimental.preprocessing.Rescaling(1.0 / 255)(x)
    x = layers.Conv2D(64, kernel_size=(7, 7), activation=tf.keras.layers.LeakyReLU(alpha=0.01), padding = "same", input_shape=image_size + (3,))(x)
    x = layers.MaxPooling2D(pool_size=(2, 2))(x)
    x = layers.Conv2D(192, kernel_size=(3, 3), activation=tf.keras.layers.LeakyReLU(alpha=0.01), padding = "same", input_shape=image_size + (3,))(x)
    x = layers.Conv2D(128, kernel_size=(1, 1), activation=tf.keras.layers.LeakyReLU(alpha=0.01), padding = "same", input_shape=image_size + (3,))(x)
    x = layers.MaxPooling2D(pool_size=(2, 2))(x)
    x = layers.Conv2D(128, kernel_size=(3, 3), activation=tf.keras.layers.LeakyReLU(alpha=0.01), padding = "same", input_shape=image_size + (3,))(x)
    x = layers.MaxPooling2D(pool_size=(2, 2))(x)
    x = layers.Dropout(0.5)(x)

    x = layers.GlobalAveragePooling2D()(x)
    if num_classes == 2:
        activation = "sigmoid"
        units = 1
    else:
        activation = "softmax"
        units = num_classes

    x = layers.Dropout(0.5)(x)

    outputs = layers.Dense(units, activation=activation)(x)
    
    return keras.Model(inputs, outputs)

However, the last model (which is much lighter and even accepts a smaller input size) seems to run at the same speed, only classifying at 2 images per second. Shouldn't there be a difference in speed being it's a smaller model? Looking at the code, is there a glaring reason why that wouldn't be the case?

I'm using the same method at inference in both cases:

image_size = (180, 180)
batch_size = 32


model = keras.models.load_model('model_13.h5')

t_end = time.time() + 10

iters = 0

while time.time() < t_end:

    img = keras.preprocessing.image.load_img(
        "test2.jpg", target_size=image_size
    )


    img_array = image.img_to_array(img)

    #print(img_array.shape)

    img_array = tf.expand_dims(img_array, 0)  # Create batch axis


    predictions = model.predict(img_array)
    score = predictions[0]

    print(score)
    iters += 1

    if score < 0.5:
        print('Fire')
    else:
        print('No Fire')


print('TOTAL: ', iters)

Sascha Kirch · Accepted Answer

The number of parameters is at most and indication how fast a model trains or runs inference. It might depend on many other factors.

Here some examples, which might influence the throughput of your model:

The activation function: ReLu activations are faster then e.g. ELU or GELU which have exponetial terms. Not only is computing an exponention number slower than a linear number, but also the gradient is much more complex to compute since in Case of Relu is constant number, the slope of the activation (e.g.1).
the bit precission used for your data. Some HW accelerators can make faster computations in float16 than in float32 and also reading less bits decreses latency.
Some layers might not have parameters but perform fixed calculations. Eventhough no parameter is added to the network's weight, a computation still is performed.
The archetecture of your training HW. Certain filter sizes and batch sizes can be computed more efficiently than others.
sometimes the speed of the computing HW is not the bottleneck, the input pipeline for loading and preprocessing your data

It's hard to tell without testing but in your particular example I would guess, that the following might slow down your inference:

large perceptive field with a 7x7 conv
leaky_relu is slightly slower than relu
Probably your data input pipeline is the bottleneck, not the inference speed. If the inference speed is much faster than the data preparation, it might appear that both models have the same speed. But in reality the HW is idle and waits for data.

To understand whats going on, you could either change some parameters and evaluate the speed, or you could analyze your input pipeline by tracing your hardware using tensorboard. Here is a smal guide: https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras

Best, Sascha

Why would a much lighter Keras model run at the same speed at inference as the much larger original model?

Answers (1)

Related Questions