Frank
Frank

Reputation: 73

High learning rate make model training fail

I just train a three-layers softmax neural network with tensorflow. It is from Andrew Ng's course, 3.11 tensorflow. I modify the code in order to see the test and train accuracy in each epoch.

When I increase the learning rate, the cost is around 1.9 and the accuracy keep 1.66...7 unchanged. I find the higher the learning rate, the more frequently it is to happen. When the learing_rate around 0.001, this situation will sometimes happen. When the learing_rate around 0.0001, this situation will not happen.

I just want to know why.

This is some output data:

learing_rate = 1
Cost after epoch 0: 1312.153492
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 100: 1.918554
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 200: 1.897831
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 300: 1.907957
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 400: 1.893983
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 500: 1.920801
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667

learing_rate = 0.01
Cost after epoch 0: 2.906999
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 100: 1.847423
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 200: 1.847042
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 300: 1.847402
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 400: 1.847197
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 500: 1.847694
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667

This is the code:

def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001,
          num_epochs = 1500, minibatch_size = 32, print_cost = True):
    """
    Implements a three-layer tensorflow neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SOFTMAX.

    Arguments:
    X_train -- training set, of shape (input size = 12288, number of training examples = 1080)
    Y_train -- test set, of shape (output size = 6, number of training examples = 1080)
    X_test -- training set, of shape (input size = 12288, number of training examples = 120)
    Y_test -- test set, of shape (output size = 6, number of test examples = 120)
    learning_rate -- learning rate of the optimization
    num_epochs -- number of epochs of the optimization loop
    minibatch_size -- size of a minibatch
    print_cost -- True to print the cost every 100 epochs

    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """

    ops.reset_default_graph()                         # to be able to rerun the model without overwriting tf variables
    tf.set_random_seed(1)                             # to keep consistent results
    seed = 3                                          # to keep consistent results
    (n_x, m) = X_train.shape                          # (n_x: input size, m : number of examples in the train set)
    n_y = Y_train.shape[0]                            # n_y : output size
    costs = []                                        # To keep track of the cost

    # Create Placeholders of shape (n_x, n_y)
    ### START CODE HERE ### (1 line)
    X, Y = create_placeholders(n_x, n_y)
    ### END CODE HERE ###

    # Initialize parameters
    ### START CODE HERE ### (1 line)
    parameters = initialize_parameters()
    ### END CODE HERE ###

    # Forward propagation: Build the forward propagation in the tensorflow graph
    ### START CODE HERE ### (1 line)
    Z3 = forward_propagation(X, parameters)
    ### END CODE HERE ###

    # Cost function: Add cost function to tensorflow graph
    ### START CODE HERE ### (1 line)
    cost = compute_cost(Z3, Y)
    ### END CODE HERE ###

    # Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer.
    ### START CODE HERE ### (1 line)
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
    ### END CODE HERE ###

    # Initialize all the variables
    init = tf.global_variables_initializer()
    # Calculate the correct predictions
    correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y))

    # Calculate accuracy on the test set
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
    # Start the session to compute the tensorflow graph
    with tf.Session() as sess:

        # Run the initialization
        sess.run(init)

        # Do the training loop
        for epoch in range(num_epochs):

            epoch_cost = 0.                       # Defines a cost related to an epoch
            num_minibatches = int(m / minibatch_size) # number of minibatches of size minibatch_size in the train set
            seed = seed + 1
            minibatches = random_mini_batches(X_train, Y_train, minibatch_size, seed)

            for minibatch in minibatches:

                # Select a minibatch
                (minibatch_X, minibatch_Y) = minibatch

                # IMPORTANT: The line that runs the graph on a minibatch.
                # Run the session to execute the "optimizer" and the "cost", the feedict should contain a minibatch for (X,Y).
                ### START CODE HERE ### (1 line)
                _ , minibatch_cost = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y})
                ### END CODE HERE ###

                epoch_cost += minibatch_cost / num_minibatches

            # Print the cost every epoch
            if print_cost == True and epoch % 100 == 0:
                print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
                print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train}))
                print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test}))
            if print_cost == True and epoch % 5 == 0:
                costs.append(epoch_cost)

        # plot the cost
        plt.plot(np.squeeze(costs))
        plt.ylabel('cost')
        plt.xlabel('iterations (per tens)')
        plt.title("Learning rate =" + str(learning_rate))
        plt.show()

        # lets save the parameters in a variable
        parameters = sess.run(parameters)
        print ("Parameters have been trained!")


        print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train}))
        print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test}))

        return parameters

parameters = model(X_train, Y_train, X_test, Y_test,learning_rate=0.001)

Upvotes: 1

Views: 1388

Answers (2)

dennlinger
dennlinger

Reputation: 11508

Reading the other answers, I'm still not quite satisfied with a few points, especially since I feel this issue can be (and has been) visualized nicely, to touch on the arguments made here.

Firstly, I agree with most of what @Shubham Panchal mentioned in his answer, and he mentions some reasonable starting values:
A high learning rate will usually end you not in convergence, but by bouncing around the solution infinitely.
A learning rate that is too small will generally yield a very slow convergence, and you might do a lot of "extra work". Visualized in this infographic here (ignore the parameter), for a 2D parameter space: gradient descent with different parameters

Your problem is likely due to "something similar" as in the right depiction. Furthermore, and this is something that has not been mentioned so far, is that the optimal learning rate (if there is even such a thing) largely depends on your specific problem setting; for my problems, a smooth convergence could be with a learning rate that is magnitudes different from yours. It (unfortunately) also makes sense to just try out a few values to narrow it down where you can achieve some reasonable result with, i.e. what you did you in your post.

Furthermore, we can also address possible solutions to this problem. One neat trick I like to apply to my models, is to reduce the learning rate every now and then. There are different available implementations of this in most frameworks:

  • Keras allows you to set the learning rate with a callback function called LearningRateScheduler.
  • PyTorch allows you to directly manipulate the learning rate like so: optimizer.param_groups[0]['lr'] = new_value.
  • TensorFlow has multiple functions that allow you to decay accordingly.

In short, the idea is to start with a relatively high learning rate (I still prefer values between 0.01-0.1 to start out with), and then gradually reduce them to make sure you eventually end up in a local minimum.

Also note that there is a whole field of research on the topic of non-convex optimizations, i.e. how to make sure that you end up with the "best possible" solution, and not just getting stuck in a local minimum. But I think this would be out of scope for now.

Upvotes: 1

user9477964
user9477964

Reputation:

In terms of gradient descent,

  1. Higher learning rates like 1.0 and 1.5 make the optimizer to take bigger steps towards the minima of the loss function. If learning rate is 1, then the change in weights is greater. Due to the bigger steps, sometimes the optimizer skips the minima and the loss began to increase again.
  2. Lower learning rates like 0.001 and 0.01 are optimal. Here, we divide the change in weights by 100 or 1000 thus making it smaller. As a result, the optimizer takes smaller steps towards the minima and hence does not skip the minima so easily.
  3. Higher learning rates make the model converge faster but may skip the minima. Lower learning rates take long time to converge but provide optimal convergence.

Upvotes: 3

Related Questions