Why is my Model's performance plateauing?

Question

I'm training a very simple model with a random set of numbers, just trying to learn y=x. The code is below, but also here: https://pastebin.com/6cNdjNNF . However, the model sometimes behaves abnormally and doesn't train/ reaches a plateau. I should note that I seed the random list of 10000 values for x, and y=x, so the data in every single iteration is identical.

    prng = np.random.RandomState(1234567891)
    x = prng.rand(10000, 1)
    y = x

    def create_model():
        dropout_nodes = 0.0
        intermediary_activation = 'relu'
        final_activation = 'linear'

        # initialize sequential model
        model = tf.keras.models.Sequential()

        layer_nodes = [16, 8, 4, 2]
        for i, layer_node in enumerate(layer_nodes):
            if i==0:
                # first layer
                model.add(tf.keras.layers.Dense(layer_node, input_dim=1))
                model.add(tf.keras.layers.Activation(intermediary_activation))
                model.add(tf.keras.layers.Dropout(dropout_nodes))
                # model.add(tf.keras.layers.BatchNormalization())
            else:
                # other layers
                model.add(tf.keras.layers.Dense(layer_node))
                model.add(tf.keras.layers.Activation(intermediary_activation))
                model.add(tf.keras.layers.Dropout(dropout_nodes))
                # model.add(tf.keras.layers.BatchNormalization()) 
        model.add(tf.keras.layers.Dense(1))
        model.add(tf.keras.layers.Activation(final_activation))

        loss = 'mse'
        metric = ["mae", "mape"]
        opt = tf.keras.optimizers.SGD(learning_rate=1e-2)
        # opt = tf.keras.optimizers.Adam(learning_rate=1e-3)

        model.compile(loss=loss, optimizer=opt, metrics=[metric])

        return model

    model = create_model()

    history = model.fit(x=x, y=y,
                        validation_split=0.1, shuffle=False,
                        epochs=20,
                        batch_size=32,
                        verbose=1, )

    pred = model.predict(x)
    
    df = pd.DataFrame(x, columns=["x"])
    df['y'] = y
    df['pred'] = pred
    model_evaluation = model.evaluate(x, y, verbose=2)
    dict_model_evaluation = {k.name: model_evaluation[i] for i, k in enumerate(model.metrics)}
    print(dict_model_evaluation)

Specifically, when printing the final evaluation, I get the following results from running the script 10 times. Notice that on five of the iterations, the results are identical; when you look at each epoch on one of these iterations, the model reaches a plateau and no longer improves. Why would this happen?

{'loss': 0.08206459134817123, 'mae': 0.24807175993919373, 'mape': 797.3375854492188}
{'loss': 4.3269268644507974e-05, 'mae': 0.0054251449182629585, 'mape': 33.66191101074219}
{'loss': 3.115053550573066e-05, 'mae': 0.003888161387294531, 'mape': 47.37348937988281}
{'loss': 0.08206459134817123, 'mae': 0.24807175993919373, 'mape': 797.3375854492188}
{'loss': 0.08206459134817123, 'mae': 0.24807175993919373, 'mape': 797.3375854492188}
{'loss': 0.08206459134817123, 'mae': 0.24807175993919373, 'mape': 797.3375854492188}
{'loss': 5.879357559024356e-06, 'mae': 0.0013944993261247873, 'mape': 23.40262794494629}
{'loss': 0.08206459134817123, 'mae': 0.24807175993919373, 'mape': 797.3375854492188}
{'loss': 6.495025900221663e-06, 'mae': 0.0019656901713460684, 'mape': 20.390905380249023}
{'loss': 1.061584316630615e-05, 'mae': 0.0014895511558279395, 'mape': 38.272361755371094}


Epoch 1/20
282/282 [==============================] - 1s 3ms/step - loss: 0.1051 - mae: 0.2714 - mape: 645.1287 - val_loss: 0.0807 - val_mae: 0.2468 - val_mape: 481.0587
Epoch 2/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 821.4997 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6515
Epoch 3/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1142 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 4/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 5/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 6/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 7/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 8/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 9/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 10/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 11/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 12/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 13/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 14/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 15/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 16/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 17/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 18/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 19/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568
Epoch 20/20
282/282 [==============================] - 1s 2ms/step - loss: 0.0822 - mae: 0.2483 - mape: 822.1162 - val_loss: 0.0806 - val_mae: 0.2468 - val_mape: 482.6568

Fatal Tempo · Accepted Answer

When running a model that hits a plateau you will notice that it actually just predicts the same number no matter what input you give it. This occasionally happens because the activation functions are causing the model to zero out. In the cases where the model plateaus if you look at the output from the second to last layer you will notice that it's output looks like this:

array([[0., 0.]], dtype=float32)

And it will look like this for any number you pass to the model, but how did it get here? Well if we look at the weights of the layer with only 2 nodes, which when I ran the model and got a result that plateaued looked like this:

[array([[-0.20621395, -0.06383181],
    [-0.7566335 , -0.67413807],
    [ 0.89420843, -0.17675757],
    [-0.9511714 , -0.27772212]], dtype=float32),
 array([0., 0.], dtype=float32)]

You will notice that the majority of them are negative and the output of both the 2 dense nodes will be negative, and due to the way relu works anything negative will become zero. So since two zeros are always being passed to the last layer the model tries its best to guess the output based on those values but will obviously not do very well and you will notice it picks a value around the middle of the data points, in this case close to 0.5.

Then the reason why this occurs for some runs and not others is due to the random initialization of the weights, and unlucky ones with alot of negatives in the layer with 2 nodes are more likely to cause the plateauing effect.

As for fixing this issue the best idea is not to bottleneck the model so much, stick with higher numbers for the amount of nodes in each dense layer, for example instead do:

layer_nodes = [32, 16, 8]

Then the likelihood of the output of a layer being all zeros is much lower. And running the model a couple times with this new layer count I never experienced the same plateauing effect. Also there could be some other ways to fix this such as changing the weight initialiazting or using a different activation function but I think my previous suggestion of increasing the nodes per layer is simpler.

Some other helpful code, how I viewed the layer weights:

model.get_layer(index=-4).get_weights()

How to view the output of the second to last layer where the bottlenecking of only 2 nodes occurred, with .34 just being a random value I tested with.

model2 = Model(inputs=model.input, outputs=model.get_layer(index=-2).output)
model2.predict([[.34]])

Why is my Model's performance plateauing?

Answers (1)

Related Questions

Why is my Model&#39;s performance plateauing?

Answers (1)

Related Questions

Why is my Model's performance plateauing?