Reputation: 1587
I'm trying to train a convolutional neural network using AdamOptimizer
(the model is inspired by VGG-16 and it is listed at the end of the question). The network produces image embeddings (transforms an image in a list of 128 values).
Untill now, I've used 0.0001 as learning rate in all my experiments (which gave me normal values for loss and accurracy).
Everything goes insane when I'm using big learning rates as 0.1 and 0.01.
I get results like:
epoch 0 loss 0.19993 acc 57.42
nr_test_examples 512
total_batch_test 1
TEST epoch 0 loss 5313544259158016.00000 acc 58.20
Test Aitor nr poze in plus 751
epoch 1 loss 20684906328883200.00000 acc 0.00
nr_test_examples 512
total_batch_test 1
TEST epoch 1 loss 1135694416986112.00000 acc 51.56
Test Aitor nr poze in plus 1963
epoch 2 loss 2697752092246016.00000 acc 0.00
nr_test_examples 512
total_batch_test 1
TEST epoch 2 loss 53017830782.00000 acc 52.73
Test Aitor nr poze in plus 1977
epoch 3 loss 128667078418.00000 acc 0.00
nr_test_examples 512
total_batch_test 1
TEST epoch 3 loss 757709846097920.00000 acc 52.34
The embedding parameter size returned by the model encreases with the loss.
For loss 0.1:
[[ 1.29028062e+22 2.76679972e+22 -1.60350428e+22 -2.59803047e+22
-7.18799158e+21 3.79426737e+22 6.16485875e+21 5.25694511e+22
1.88533167e+22 2.83884797e+21 8.02921163e+21 -9.36909501e+21
-1.44595632e+22 -2.42238243e+22 2.02972577e+21 1.05234577e+22
-1.80612585e+22 -4.78811634e+22 1.49373501e+22 5.06000855e+22
3.70631387e+22 1.84049113e+22 -3.99712842e+22 3.87442379e+22
1.75347753e+22 5.92351884e+22 -3.53815667e+22 -1.82951788e+22
-6.43566909e+22 2.47560282e+22 5.30715552e+21 1.83587696e+22
-7.92202990e+21 1.67361902e+22 8.59540559e+20 -3.81585403e+22
-1.21638398e+22 4.17503997e+22 -1.22125473e+22 2.79304332e+22
-4.56848209e+22 1.57062125e+22 -2.50028311e+21 -2.62136002e+22
4.54086438e+21 -1.56374639e+22 -9.88864603e+21 -4.41802088e+22
-1.34634863e+22 5.70279618e+21 2.03487718e+22 -2.43145786e+22
3.17775273e+22 -1.20715622e+22 2.58878188e+22 5.10632087e+22
4.19953009e+22 3.96467818e+22 -1.04965802e+22 3.02379628e+22
-5.25661860e+22 3.07441015e+21 -5.18819518e+21 2.95340929e+22
1.14506092e+22 1.15907500e+22 6.69119500e+21 3.77412660e+22
-3.94501085e+21 1.33659958e+22 -1.60639323e+22 4.13619597e+22
2.68251817e+21 6.45229424e+21 -2.73042746e+21 4.42164447e+22
2.80798401e+22 -1.88889266e+22 4.13956748e+21 3.89647612e+21
-3.97987648e+22 3.42041704e+22 -7.92604683e+20 6.57421467e+22
-8.36352284e+21 -3.10638036e+22 4.72475508e+21 -1.85049497e+22
-2.01018620e+22 -4.16415747e+22 -1.26361030e+22 3.21139147e+22
9.59236321e+21 1.88358765e+22 -1.30287966e+22 -7.88201598e+21
3.74658596e+22 -1.73451794e+22 3.64240847e+22 3.83275750e+21
3.18538926e+22 -2.88709469e+22 -3.58837879e+22 -8.98292556e+20
1.61682176e+22 -4.03502305e+22 1.66714803e+22 -1.75002721e+22
1.72512196e+22 1.00159954e+22 1.31722408e+22 -6.84561825e+22
1.55648918e+22 1.01815039e+22 2.80281495e+21 2.46405536e+22
-3.38236179e+21 -4.50928036e+21 -3.56030898e+22 3.63372148e+22
-2.91085715e+21 1.96335417e+22 -9.57801362e+21 4.60519886e+21
2.86536550e+22 3.00846580e+22 8.66609606e+21 8.57120803e+21]]
For loss 0.01:
[[ 135379.078125 427807.0625 -211165.5 -270527.875
263263.46875 61203.9765625 243880.703125 -134595.53125
65044.28125 -133903.921875 -326986.875 -346536.375 349003.
-138743.328125 440702.1875 -108623.6484375 73725.84375
-140035.90625 -357855.75 338021.65625 247224.15625
-85308.8515625 -511153.90625 206612.296875 -317970.0625
-95346.1796875 -24617.36523438 -369452.21875 -477215.0625
-154431.234375 281639.625 -387593.4375 96041.2109375
-184906.59375 107803.296875 74392.546875 463264.78125
239308.84375 743635.375 -40640.921875 6956.1953125
284925.75 -649819.3125 -295953.34375 38507.95703125
35773.08984375 214856.546875 -289618.78125 381939.90625
-68496.5546875 418068.46875 627032.625 182973.40625
119805.296875 14911.890625 475292.40625 -265693.125
-416467.28125 -354252.125 -162428.90625 336221.15625
41771.5625 -395673.09375 149899.5 -86771.7421875
-84667.2890625 -299950.8125 537230.5625 -138381.921875
294517.21875 92734.6015625 26118.45898438 380978.34375
-524781.9375 -150150.921875 563931.875 212278.8125
-156267.859375 -7298.81445312 -546963.125 155122.828125
-41295.8359375 46307.93359375 -128129.0546875 36079.36328125
-460227.65625 123968.7734375 728651.4375 252526.984375
-126041.7734375 265436. -74924.703125 244991.8125
38667.71875 -29434.65429688 374994.15625 -146754.859375
180715.015625 95923.5078125 479208.21875 333908.5
132672.703125 -402727.09375 -425125.03125 -68114.1640625
122268.4375 -308014.96875 473961.40625 370820.125
-502812.3125 201727.015625 156381.46875 337941.125
-291394.9375 273098.71875 -91102.7421875 64342.390625
-316238.625 291803.21875 -413403.4375 207456.203125
106696.90625 -274239.90625 266393.65625 50893.91015625
149943.265625 -100018.765625 -283917.65625 ]]
For loss 0.0001 I get small values.
I'm accumulating gradients (because I don't have enough resources to pass big batches at once):
tvs = tf.trainable_variables() ## Retrieve all trainable variables you defined in your graph
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.01
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, 25600, 0.96, staircase=True)
opt = tf.train.AdamOptimizer(learning_rate)
## Creation of a list of variables with the same shape as the trainable ones
# initialized with 0s
accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
## Calls the compute_gradients function of the optimizer to obtain... the list of gradients
gvs = opt.compute_gradients(cost, tvs);
## Adds to each element from the list you initialized earlier with zeros its gradient (works because accum_vars and gvs are in the same order)
accum_ops = [accum_vars[i].assign_add(gv[0]) for i, gv in enumerate(gvs)]
train_step = opt.apply_gradients([(accum_vars[i], gv[1]) for i, gv in enumerate(gvs)], global_step=global_step)
Why does this happen?
Here is my model:
def siamese_convnet(x):
w_conv1_1 = tf.get_variable(name='w_conv1_1', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 1, 64])
w_conv1_2 = tf.get_variable(name='w_conv1_2', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 64, 64])
w_conv2_1 = tf.get_variable(name='w_conv2_1', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 64, 128])
w_conv2_2 = tf.get_variable(name='w_conv2_2', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 128, 128])
w_conv3_1 = tf.get_variable(name='w_conv3_1', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 128, 256])
w_conv3_2 = tf.get_variable(name='w_conv3_2', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 256, 256])
w_conv3_3 = tf.get_variable(name='w_conv3_3', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 256, 256])
w_conv4_1 = tf.get_variable(name='w_conv4_1', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 256, 512])
w_conv4_2 = tf.get_variable(name='w_conv4_2', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 512, 512])
w_conv4_3 = tf.get_variable(name='w_conv4_3', initializer=tf.contrib.layers.xavier_initializer(), shape=[1, 1, 512, 512])
w_conv5_1 = tf.get_variable(name='w_conv5_1', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 512, 512])
w_conv5_2 = tf.get_variable(name='w_conv5_2', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 512, 512])
w_conv5_3 = tf.get_variable(name='w_conv5_3', initializer=tf.contrib.layers.xavier_initializer(), shape=[1, 1, 512, 512])
w_fc_1 = tf.get_variable(name='w_fc_1', initializer=tf.contrib.layers.xavier_initializer(), shape=[5*5*512, 2048])
w_fc_2 = tf.get_variable(name='w_fc_2', initializer=tf.contrib.layers.xavier_initializer(), shape=[2048, 1024])
w_out = tf.get_variable(name='w_out', initializer=tf.contrib.layers.xavier_initializer(), shape=[1024, 128])
bias_conv1_1 = tf.get_variable(name='bias_conv1_1', initializer=tf.constant(0.01, shape=[64]))
bias_conv1_2 = tf.get_variable(name='bias_conv1_2', initializer=tf.constant(0.01, shape=[64]))
bias_conv2_1 = tf.get_variable(name='bias_conv2_1', initializer=tf.constant(0.01, shape=[128]))
bias_conv2_2 = tf.get_variable(name='bias_conv2_2', initializer=tf.constant(0.01, shape=[128]))
bias_conv3_1 = tf.get_variable(name='bias_conv3_1', initializer=tf.constant(0.01, shape=[256]))
bias_conv3_2 = tf.get_variable(name='bias_conv3_2', initializer=tf.constant(0.01, shape=[256]))
bias_conv3_3 = tf.get_variable(name='bias_conv3_3', initializer=tf.constant(0.01, shape=[256]))
bias_conv4_1 = tf.get_variable(name='bias_conv4_1', initializer=tf.constant(0.01, shape=[512]))
bias_conv4_2 = tf.get_variable(name='bias_conv4_2', initializer=tf.constant(0.01, shape=[512]))
bias_conv4_3 = tf.get_variable(name='bias_conv4_3', initializer=tf.constant(0.01, shape=[512]))
bias_conv5_1 = tf.get_variable(name='bias_conv5_1', initializer=tf.constant(0.01, shape=[512]))
bias_conv5_2 = tf.get_variable(name='bias_conv5_2', initializer=tf.constant(0.01, shape=[512]))
bias_conv5_3 = tf.get_variable(name='bias_conv5_3', initializer=tf.constant(0.01, shape=[512]))
bias_fc_1 = tf.get_variable(name='bias_fc_1', initializer=tf.constant(0.01, shape=[2048]))
bias_fc_2 = tf.get_variable(name='bias_fc_2', initializer=tf.constant(0.01, shape=[1024]))
''''bias_fc = tf.get_variable(name='bias_fc', initializer=tf.zeros([1024]))'''
out = tf.get_variable(name='out', initializer=tf.constant(0.01, shape=[128]))
x = tf.reshape(x , [-1, 160, 160, 1]);
conv1_1 = tf.nn.relu(conv2d(x, w_conv1_1) + bias_conv1_1);
conv1_2= tf.nn.relu(conv2d(conv1_1, w_conv1_2) + bias_conv1_2);
max_pool1 = max_pool(conv1_2);
#max_pool1 = tf.nn.dropout(max_pool1, keep_rate)
conv2_1 = tf.nn.relu( conv2d(max_pool1, w_conv2_1) + bias_conv2_1 );
conv2_2 = tf.nn.relu( conv2d(conv2_1, w_conv2_2) + bias_conv2_2 );
max_pool2 = max_pool(conv2_2)
#max_pool2 = tf.nn.dropout(max_pool2, keep_rate)
conv3_1 = tf.nn.relu( conv2d(max_pool2, w_conv3_1) + bias_conv3_1 );
conv3_2 = tf.nn.relu( conv2d(conv3_1, w_conv3_2) + bias_conv3_2 );
conv3_3 = tf.nn.relu( conv2d(conv3_2, w_conv3_3) + bias_conv3_3 );
max_pool3 = max_pool(conv3_3)
#max_pool3 = tf.nn.dropout(max_pool3, keep_rate)
conv4_1 = tf.nn.relu( conv2d(max_pool3, w_conv4_1) + bias_conv4_1 );
conv4_2 = tf.nn.relu( conv2d(conv4_1, w_conv4_2) + bias_conv4_2 );
conv4_3 = tf.nn.relu( conv2d(conv4_2, w_conv4_3) + bias_conv4_3 );
max_pool4 = max_pool(conv4_3)
#max_pool4 = tf.nn.dropout(max_pool4, keep_rate)
conv5_1 = tf.nn.relu( conv2d(max_pool4, w_conv5_1) + bias_conv5_1 );
conv5_2 = tf.nn.relu( conv2d(conv5_1, w_conv5_2) + bias_conv5_2 );
conv5_3 = tf.nn.relu( conv2d(conv5_2, w_conv5_3) + bias_conv5_3 );
max_pool5 = max_pool(conv5_3)
#max_pool5 = tf.nn.dropout(max_pool5, keep_rate)
fc_helper = tf.reshape(max_pool5, [-1, 5*5*512]);
fc_1 = tf.nn.relu( tf.matmul(fc_helper, w_fc_1) + bias_fc_1 );
#fc_1 = tf.nn.dropout(fc_1, keep_rate)
fc_2 = tf.nn.relu( tf.matmul(fc_1, w_fc_2) + bias_fc_2 );
#fc_2 = tf.nn.dropout(fc_2, 0.7)
'''fc = tf.nn.relu( tf.matmul(fc_1, fc_layer) + bias_fc );
fc = tf.nn.dropout(fc, keep_rate)
output = tf.matmul(fc, w_out) + out'''
output = tf.matmul(fc_2, w_out) + out
#output = tf.nn.l2_normalize(output, 0)
return output
Upvotes: 0
Views: 81
Reputation: 824
Backpropagation uses gradient descent to determine how much to change the weights in your model. These changes are multiplied by the learning rate. When the learning rate is large, the changes to the weights are also large; if the learning rate is too large, you can end up far away from where you were, and the cost function will probably be significantly worse. It's also possible for your learning rate to be too small, in which case it will take a long time for your model to improve.
In general, if your costs are blowing up, your learning rate is probably too high.
Upvotes: 2