M.Y. Babt
M.Y. Babt

Reputation: 2891

How to define weight decay for individual layers in TensorFlow?

In CUDA ConvNet, we can write something like this (source) for each layer:

[conv32]
epsW=0.001
epsB=0.002
momW=0.9
momB=0.9
wc=0

where wc=0 refers to the L2 weight decay.

How can the same be achieved in TensorFlow?

Upvotes: 12

Views: 11943

Answers (3)

LucasB
LucasB

Reputation: 3533

Both current answers are wrong in that they do not give you "weight decay as in cuda-convnet" but instead L2-regularization, which is different.

When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.

Weight decay (don't know how to TeX here, so excuse my pseudo-notation):

w[t+1] = w[t] - learning_rate * dw - weight_decay * w

L2-regularization:

loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)

Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation

dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw

gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)

That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.

One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:

# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
               weights_regularizer=layers.l2_regularizer(weight_decay)):
    # define the network.

loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
    with tf.control_dependencies([train_op]):
        sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
        train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))

This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.

Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.

Edit: see also this PR which just got merged into TF.

Upvotes: 2

plustar
plustar

Reputation: 91

get_variable(
name,
shape=None,
dtype=None,
initializer=None,
regularizer=None,
trainable=True,
collections=None,
caching_device=None,
partitioner=None,
validate_shape=True,
use_resource=None,
custom_getter=None)

This is the usage of tensorflow function get_variable. You can easily specify the regularizer to do weight decay.

Following is an example:

weight_decay = tf.constant(0.0005, dtype=tf.float32) # your weight decay rate, must be a scalar tensor.
W = tf.get_variable(name='weight', shape=[4, 4, 256, 512], regularizer=tf.contrib.layers.l2_regularizer(weight_decay))

Upvotes: 9

Clash
Clash

Reputation: 5025

You can add all the variables you want to add weight decay to, to a collection name 'variables' and then you calculate the L2 norm weight decay for the whole collection.

  # Create your variables
  weights = tf.get_variable('weights', collections=['variables'])

  with tf.variable_scope('weights_norm') as scope:
    weights_norm = tf.reduce_sum(
      input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
          [tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
      ),
      name='weights_norm'
  )

  # Add the weight decay loss to another collection called losses
  tf.add_to_collection('losses', weights_norm)

  # Add the other loss components to the collection losses     
  # ...

  # To calculate your total loss
  tf.add_n(tf.get_collection('losses'), name='total_loss')

Upvotes: 16

Related Questions