Reputation: 73
I just train a three-layers softmax neural network with tensorflow. It is from Andrew Ng's course, 3.11 tensorflow. I modify the code in order to see the test and train accuracy in each epoch.
When I increase the learning rate, the cost is around 1.9 and the accuracy keep 1.66...7 unchanged. I find the higher the learning rate, the more frequently it is to happen. When the learing_rate around 0.001, this situation will sometimes happen. When the learing_rate around 0.0001, this situation will not happen.
I just want to know why.
This is some output data:
learing_rate = 1
Cost after epoch 0: 1312.153492
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 100: 1.918554
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 200: 1.897831
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 300: 1.907957
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 400: 1.893983
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 500: 1.920801
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
learing_rate = 0.01
Cost after epoch 0: 2.906999
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 100: 1.847423
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 200: 1.847042
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 300: 1.847402
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 400: 1.847197
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
Cost after epoch 500: 1.847694
Train Accuracy: 0.16666667
Test Accuracy: 0.16666667
This is the code:
def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.0001,
num_epochs = 1500, minibatch_size = 32, print_cost = True):
"""
Implements a three-layer tensorflow neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SOFTMAX.
Arguments:
X_train -- training set, of shape (input size = 12288, number of training examples = 1080)
Y_train -- test set, of shape (output size = 6, number of training examples = 1080)
X_test -- training set, of shape (input size = 12288, number of training examples = 120)
Y_test -- test set, of shape (output size = 6, number of test examples = 120)
learning_rate -- learning rate of the optimization
num_epochs -- number of epochs of the optimization loop
minibatch_size -- size of a minibatch
print_cost -- True to print the cost every 100 epochs
Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""
ops.reset_default_graph() # to be able to rerun the model without overwriting tf variables
tf.set_random_seed(1) # to keep consistent results
seed = 3 # to keep consistent results
(n_x, m) = X_train.shape # (n_x: input size, m : number of examples in the train set)
n_y = Y_train.shape[0] # n_y : output size
costs = [] # To keep track of the cost
# Create Placeholders of shape (n_x, n_y)
### START CODE HERE ### (1 line)
X, Y = create_placeholders(n_x, n_y)
### END CODE HERE ###
# Initialize parameters
### START CODE HERE ### (1 line)
parameters = initialize_parameters()
### END CODE HERE ###
# Forward propagation: Build the forward propagation in the tensorflow graph
### START CODE HERE ### (1 line)
Z3 = forward_propagation(X, parameters)
### END CODE HERE ###
# Cost function: Add cost function to tensorflow graph
### START CODE HERE ### (1 line)
cost = compute_cost(Z3, Y)
### END CODE HERE ###
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer.
### START CODE HERE ### (1 line)
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
### END CODE HERE ###
# Initialize all the variables
init = tf.global_variables_initializer()
# Calculate the correct predictions
correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y))
# Calculate accuracy on the test set
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
sess.run(init)
# Do the training loop
for epoch in range(num_epochs):
epoch_cost = 0. # Defines a cost related to an epoch
num_minibatches = int(m / minibatch_size) # number of minibatches of size minibatch_size in the train set
seed = seed + 1
minibatches = random_mini_batches(X_train, Y_train, minibatch_size, seed)
for minibatch in minibatches:
# Select a minibatch
(minibatch_X, minibatch_Y) = minibatch
# IMPORTANT: The line that runs the graph on a minibatch.
# Run the session to execute the "optimizer" and the "cost", the feedict should contain a minibatch for (X,Y).
### START CODE HERE ### (1 line)
_ , minibatch_cost = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y})
### END CODE HERE ###
epoch_cost += minibatch_cost / num_minibatches
# Print the cost every epoch
if print_cost == True and epoch % 100 == 0:
print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train}))
print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test}))
if print_cost == True and epoch % 5 == 0:
costs.append(epoch_cost)
# plot the cost
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()
# lets save the parameters in a variable
parameters = sess.run(parameters)
print ("Parameters have been trained!")
print ("Train Accuracy:", accuracy.eval({X: X_train, Y: Y_train}))
print ("Test Accuracy:", accuracy.eval({X: X_test, Y: Y_test}))
return parameters
parameters = model(X_train, Y_train, X_test, Y_test,learning_rate=0.001)
Upvotes: 1
Views: 1388
Reputation: 11508
Reading the other answers, I'm still not quite satisfied with a few points, especially since I feel this issue can be (and has been) visualized nicely, to touch on the arguments made here.
Firstly, I agree with most of what @Shubham Panchal mentioned in his answer, and he mentions some reasonable starting values:
A high learning rate will usually end you not in convergence, but by bouncing around the solution infinitely.
A learning rate that is too small will generally yield a very slow convergence, and you might do a lot of "extra work".
Visualized in this infographic here (ignore the parameter), for a 2D parameter space:
Your problem is likely due to "something similar" as in the right depiction. Furthermore, and this is something that has not been mentioned so far, is that the optimal learning rate (if there is even such a thing) largely depends on your specific problem setting; for my problems, a smooth convergence could be with a learning rate that is magnitudes different from yours. It (unfortunately) also makes sense to just try out a few values to narrow it down where you can achieve some reasonable result with, i.e. what you did you in your post.
Furthermore, we can also address possible solutions to this problem. One neat trick I like to apply to my models, is to reduce the learning rate every now and then. There are different available implementations of this in most frameworks:
LearningRateScheduler
.optimizer.param_groups[0]['lr'] = new_value
.In short, the idea is to start with a relatively high learning rate (I still prefer values between 0.01-0.1 to start out with), and then gradually reduce them to make sure you eventually end up in a local minimum.
Also note that there is a whole field of research on the topic of non-convex optimizations, i.e. how to make sure that you end up with the "best possible" solution, and not just getting stuck in a local minimum. But I think this would be out of scope for now.
Upvotes: 1
Reputation:
In terms of gradient descent,
Upvotes: 3