Larger learning rate determines large weights

Question

I'm trying to train a convolutional neural network using AdamOptimizer (the model is inspired by VGG-16 and it is listed at the end of the question). The network produces image embeddings (transforms an image in a list of 128 values).

Untill now, I've used 0.0001 as learning rate in all my experiments (which gave me normal values for loss and accurracy).

Everything goes insane when I'm using big learning rates as 0.1 and 0.01.

I get results like:

epoch 0  loss 0.19993 acc 57.42 
nr_test_examples 512
total_batch_test 1
TEST epoch 0  loss 5313544259158016.00000 acc 58.20 
Test Aitor nr poze in plus 751
epoch 1  loss 20684906328883200.00000 acc 0.00 
nr_test_examples 512
total_batch_test 1
TEST epoch 1  loss 1135694416986112.00000 acc 51.56 
Test Aitor nr poze in plus 1963
epoch 2  loss 2697752092246016.00000 acc 0.00 
nr_test_examples 512
total_batch_test 1
TEST epoch 2  loss 53017830782.00000 acc 52.73 
Test Aitor nr poze in plus 1977
epoch 3  loss 128667078418.00000 acc 0.00 
nr_test_examples 512
total_batch_test 1
TEST epoch 3  loss 757709846097920.00000 acc 52.34

The embedding parameter size returned by the model encreases with the loss.

For loss 0.1:

[[  1.29028062e+22   2.76679972e+22  -1.60350428e+22  -2.59803047e+22
   -7.18799158e+21   3.79426737e+22   6.16485875e+21   5.25694511e+22
    1.88533167e+22   2.83884797e+21   8.02921163e+21  -9.36909501e+21
   -1.44595632e+22  -2.42238243e+22   2.02972577e+21   1.05234577e+22
   -1.80612585e+22  -4.78811634e+22   1.49373501e+22   5.06000855e+22
    3.70631387e+22   1.84049113e+22  -3.99712842e+22   3.87442379e+22
    1.75347753e+22   5.92351884e+22  -3.53815667e+22  -1.82951788e+22
   -6.43566909e+22   2.47560282e+22   5.30715552e+21   1.83587696e+22
   -7.92202990e+21   1.67361902e+22   8.59540559e+20  -3.81585403e+22
   -1.21638398e+22   4.17503997e+22  -1.22125473e+22   2.79304332e+22
   -4.56848209e+22   1.57062125e+22  -2.50028311e+21  -2.62136002e+22
    4.54086438e+21  -1.56374639e+22  -9.88864603e+21  -4.41802088e+22
   -1.34634863e+22   5.70279618e+21   2.03487718e+22  -2.43145786e+22
    3.17775273e+22  -1.20715622e+22   2.58878188e+22   5.10632087e+22
    4.19953009e+22   3.96467818e+22  -1.04965802e+22   3.02379628e+22
   -5.25661860e+22   3.07441015e+21  -5.18819518e+21   2.95340929e+22
    1.14506092e+22   1.15907500e+22   6.69119500e+21   3.77412660e+22
   -3.94501085e+21   1.33659958e+22  -1.60639323e+22   4.13619597e+22
    2.68251817e+21   6.45229424e+21  -2.73042746e+21   4.42164447e+22
    2.80798401e+22  -1.88889266e+22   4.13956748e+21   3.89647612e+21
   -3.97987648e+22   3.42041704e+22  -7.92604683e+20   6.57421467e+22
   -8.36352284e+21  -3.10638036e+22   4.72475508e+21  -1.85049497e+22
   -2.01018620e+22  -4.16415747e+22  -1.26361030e+22   3.21139147e+22
    9.59236321e+21   1.88358765e+22  -1.30287966e+22  -7.88201598e+21
    3.74658596e+22  -1.73451794e+22   3.64240847e+22   3.83275750e+21
    3.18538926e+22  -2.88709469e+22  -3.58837879e+22  -8.98292556e+20
    1.61682176e+22  -4.03502305e+22   1.66714803e+22  -1.75002721e+22
    1.72512196e+22   1.00159954e+22   1.31722408e+22  -6.84561825e+22
    1.55648918e+22   1.01815039e+22   2.80281495e+21   2.46405536e+22
   -3.38236179e+21  -4.50928036e+21  -3.56030898e+22   3.63372148e+22
   -2.91085715e+21   1.96335417e+22  -9.57801362e+21   4.60519886e+21
    2.86536550e+22   3.00846580e+22   8.66609606e+21   8.57120803e+21]]

For loss 0.01:

[[ 135379.078125    427807.0625     -211165.5        -270527.875
   263263.46875      61203.9765625   243880.703125   -134595.53125
    65044.28125    -133903.921875   -326986.875      -346536.375       349003.
  -138743.328125    440702.1875     -108623.6484375    73725.84375
  -140035.90625    -357855.75        338021.65625     247224.15625
   -85308.8515625  -511153.90625     206612.296875   -317970.0625
   -95346.1796875   -24617.36523438 -369452.21875    -477215.0625
  -154431.234375    281639.625      -387593.4375       96041.2109375
  -184906.59375     107803.296875     74392.546875    463264.78125
   239308.84375     743635.375       -40640.921875      6956.1953125
   284925.75       -649819.3125     -295953.34375      38507.95703125
    35773.08984375  214856.546875   -289618.78125     381939.90625
   -68496.5546875   418068.46875     627032.625       182973.40625
   119805.296875     14911.890625    475292.40625    -265693.125
  -416467.28125    -354252.125      -162428.90625     336221.15625
    41771.5625     -395673.09375     149899.5         -86771.7421875
   -84667.2890625  -299950.8125      537230.5625     -138381.921875
   294517.21875      92734.6015625    26118.45898438  380978.34375
  -524781.9375     -150150.921875    563931.875       212278.8125
  -156267.859375     -7298.81445312 -546963.125       155122.828125
   -41295.8359375    46307.93359375 -128129.0546875    36079.36328125
  -460227.65625     123968.7734375   728651.4375      252526.984375
  -126041.7734375   265436.          -74924.703125    244991.8125
    38667.71875     -29434.65429688  374994.15625    -146754.859375
   180715.015625     95923.5078125   479208.21875     333908.5
   132672.703125   -402727.09375    -425125.03125     -68114.1640625
   122268.4375     -308014.96875     473961.40625     370820.125
  -502812.3125      201727.015625    156381.46875     337941.125
  -291394.9375      273098.71875     -91102.7421875    64342.390625
  -316238.625       291803.21875    -413403.4375      207456.203125
   106696.90625    -274239.90625     266393.65625      50893.91015625
   149943.265625   -100018.765625   -283917.65625   ]]

For loss 0.0001 I get small values.

I'm accumulating gradients (because I don't have enough resources to pass big batches at once):

    tvs = tf.trainable_variables() ## Retrieve all trainable variables you defined in your graph

    global_step = tf.Variable(0, trainable=False)
    starter_learning_rate = 0.01
    learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, 25600, 0.96, staircase=True)

    opt = tf.train.AdamOptimizer(learning_rate)

    ## Creation of a list of variables with the same shape as the trainable ones
    # initialized with 0s
    accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]                                        
    zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]

    ## Calls the compute_gradients function of the optimizer to obtain... the list of gradients
    gvs = opt.compute_gradients(cost, tvs);

    ## Adds to each element from the list you initialized earlier with zeros its gradient (works because accum_vars and gvs are in the same order)
    accum_ops = [accum_vars[i].assign_add(gv[0]) for i, gv in enumerate(gvs)]

    train_step = opt.apply_gradients([(accum_vars[i], gv[1]) for i, gv in enumerate(gvs)], global_step=global_step)

Why does this happen?

Here is my model:

def siamese_convnet(x):


    w_conv1_1 = tf.get_variable(name='w_conv1_1', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 1, 64])
    w_conv1_2 = tf.get_variable(name='w_conv1_2', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 64, 64])

    w_conv2_1 = tf.get_variable(name='w_conv2_1', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 64, 128])
    w_conv2_2 = tf.get_variable(name='w_conv2_2', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 128, 128])

    w_conv3_1 = tf.get_variable(name='w_conv3_1', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 128, 256])
    w_conv3_2 = tf.get_variable(name='w_conv3_2', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 256, 256])
    w_conv3_3 = tf.get_variable(name='w_conv3_3', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 256, 256])

    w_conv4_1 = tf.get_variable(name='w_conv4_1', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 256, 512])
    w_conv4_2 = tf.get_variable(name='w_conv4_2', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 512, 512])
    w_conv4_3 = tf.get_variable(name='w_conv4_3', initializer=tf.contrib.layers.xavier_initializer(), shape=[1, 1, 512, 512])

    w_conv5_1 = tf.get_variable(name='w_conv5_1', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 512, 512])
    w_conv5_2 = tf.get_variable(name='w_conv5_2', initializer=tf.contrib.layers.xavier_initializer(), shape=[3, 3, 512, 512])
    w_conv5_3 = tf.get_variable(name='w_conv5_3', initializer=tf.contrib.layers.xavier_initializer(), shape=[1, 1, 512, 512])

    w_fc_1 = tf.get_variable(name='w_fc_1', initializer=tf.contrib.layers.xavier_initializer(), shape=[5*5*512, 2048])
    w_fc_2 = tf.get_variable(name='w_fc_2', initializer=tf.contrib.layers.xavier_initializer(), shape=[2048, 1024])


    w_out = tf.get_variable(name='w_out', initializer=tf.contrib.layers.xavier_initializer(), shape=[1024, 128])

    bias_conv1_1 = tf.get_variable(name='bias_conv1_1', initializer=tf.constant(0.01, shape=[64]))
    bias_conv1_2 = tf.get_variable(name='bias_conv1_2', initializer=tf.constant(0.01, shape=[64]))

    bias_conv2_1 = tf.get_variable(name='bias_conv2_1', initializer=tf.constant(0.01, shape=[128]))
    bias_conv2_2 = tf.get_variable(name='bias_conv2_2', initializer=tf.constant(0.01, shape=[128]))

    bias_conv3_1 = tf.get_variable(name='bias_conv3_1', initializer=tf.constant(0.01, shape=[256]))
    bias_conv3_2 = tf.get_variable(name='bias_conv3_2', initializer=tf.constant(0.01, shape=[256]))
    bias_conv3_3 = tf.get_variable(name='bias_conv3_3', initializer=tf.constant(0.01, shape=[256]))

    bias_conv4_1 = tf.get_variable(name='bias_conv4_1', initializer=tf.constant(0.01, shape=[512]))
    bias_conv4_2 = tf.get_variable(name='bias_conv4_2', initializer=tf.constant(0.01, shape=[512]))
    bias_conv4_3 = tf.get_variable(name='bias_conv4_3', initializer=tf.constant(0.01, shape=[512]))

    bias_conv5_1 = tf.get_variable(name='bias_conv5_1', initializer=tf.constant(0.01, shape=[512]))
    bias_conv5_2 = tf.get_variable(name='bias_conv5_2', initializer=tf.constant(0.01, shape=[512]))
    bias_conv5_3 = tf.get_variable(name='bias_conv5_3', initializer=tf.constant(0.01, shape=[512]))

    bias_fc_1 = tf.get_variable(name='bias_fc_1', initializer=tf.constant(0.01, shape=[2048]))
    bias_fc_2 = tf.get_variable(name='bias_fc_2', initializer=tf.constant(0.01, shape=[1024]))

    ''''bias_fc = tf.get_variable(name='bias_fc', initializer=tf.zeros([1024]))'''
    out = tf.get_variable(name='out', initializer=tf.constant(0.01, shape=[128]))

    x = tf.reshape(x , [-1, 160, 160, 1]);

    conv1_1 = tf.nn.relu(conv2d(x, w_conv1_1) + bias_conv1_1);
    conv1_2= tf.nn.relu(conv2d(conv1_1, w_conv1_2) + bias_conv1_2);

    max_pool1 = max_pool(conv1_2);
    #max_pool1 = tf.nn.dropout(max_pool1, keep_rate)

    conv2_1 = tf.nn.relu( conv2d(max_pool1, w_conv2_1) + bias_conv2_1 );
    conv2_2 = tf.nn.relu( conv2d(conv2_1, w_conv2_2) + bias_conv2_2 );

    max_pool2 = max_pool(conv2_2)
    #max_pool2 = tf.nn.dropout(max_pool2, keep_rate)

    conv3_1 = tf.nn.relu( conv2d(max_pool2, w_conv3_1) + bias_conv3_1 );
    conv3_2 = tf.nn.relu( conv2d(conv3_1, w_conv3_2) + bias_conv3_2 );
    conv3_3 = tf.nn.relu( conv2d(conv3_2, w_conv3_3) + bias_conv3_3 );

    max_pool3 = max_pool(conv3_3)
    #max_pool3 = tf.nn.dropout(max_pool3, keep_rate)

    conv4_1 = tf.nn.relu( conv2d(max_pool3, w_conv4_1) + bias_conv4_1 );
    conv4_2 = tf.nn.relu( conv2d(conv4_1, w_conv4_2) + bias_conv4_2 );
    conv4_3 = tf.nn.relu( conv2d(conv4_2, w_conv4_3) + bias_conv4_3 );

    max_pool4 = max_pool(conv4_3)
    #max_pool4 = tf.nn.dropout(max_pool4, keep_rate)

    conv5_1 = tf.nn.relu( conv2d(max_pool4, w_conv5_1) + bias_conv5_1 );
    conv5_2 = tf.nn.relu( conv2d(conv5_1, w_conv5_2) + bias_conv5_2 );
    conv5_3 = tf.nn.relu( conv2d(conv5_2, w_conv5_3) + bias_conv5_3 );

    max_pool5 = max_pool(conv5_3)
    #max_pool5 = tf.nn.dropout(max_pool5, keep_rate)

    fc_helper = tf.reshape(max_pool5, [-1, 5*5*512]);
    fc_1 = tf.nn.relu( tf.matmul(fc_helper, w_fc_1) + bias_fc_1 );
    #fc_1 = tf.nn.dropout(fc_1, keep_rate)

    fc_2 = tf.nn.relu( tf.matmul(fc_1, w_fc_2) + bias_fc_2 );
    #fc_2 = tf.nn.dropout(fc_2, 0.7)

    '''fc = tf.nn.relu( tf.matmul(fc_1, fc_layer) + bias_fc );
    fc = tf.nn.dropout(fc, keep_rate)
    output = tf.matmul(fc, w_out) + out'''

    output = tf.matmul(fc_2, w_out) + out
    #output = tf.nn.l2_normalize(output, 0)

    return output

Stephen · Accepted Answer

Backpropagation uses gradient descent to determine how much to change the weights in your model. These changes are multiplied by the learning rate. When the learning rate is large, the changes to the weights are also large; if the learning rate is too large, you can end up far away from where you were, and the cost function will probably be significantly worse. It's also possible for your learning rate to be too small, in which case it will take a long time for your model to improve.

In general, if your costs are blowing up, your learning rate is probably too high.

Larger learning rate determines large weights

Answers (1)

Related Questions