Is there an error in the way the params are updated in the following Theano method?

Question

I was going through a tutorial online for momentum based learning and came across this method in Theano

def gradient_updates_momentum(cost, params, learning_rate, momentum):
    '''
Compute updates for gradient descent with momentum

:parameters:
    - cost : theano.tensor.var.TensorVariable
        Theano cost function to minimize
    - params : list of theano.tensor.var.TensorVariable
        Parameters to compute gradient against
    - learning_rate : float
        Gradient descent learning rate
    - momentum : float
        Momentum parameter, should be at least 0 (standard gradient descent) and less than 1

:returns:
    updates : list
        List of updates, one for each parameter
'''
# Make sure momentum is a sane value
assert momentum < 1 and momentum >= 0
# List of update steps for each parameter
updates = []
# Just gradient descent on cost
for param in params:
    # For each parameter, we'll create a param_update shared variable.
    # This variable will keep track of the parameter's update step across iterations.
    # We initialize it to 0
    param_update = theano.shared(param.get_value()*0., broadcastable=param.broadcastable)
    # Each parameter is updated by taking a step in the direction of the gradient.
    # However, we also "mix in" the previous step according to the given momentum value.
    # Note that when updating param_update, we are using its old value and also the new gradient step.
    updates.append((param, param - learning_rate*param_update))
    # Note that we don't need to derive backpropagation to compute updates - just use T.grad!
    updates.append((param_update, momentum*param_update + (1. - momentum)*T.grad(cost, param)))
return updates

Shouldn't the order of the following two lines be the other way round (interchanged) ?

updates.append((param, param - learning_rate*param_update))

and

updates.append((param_update, momentum*param_update + (1. - momentum)*T.grad(cost, param)))

I understand that after the train method is executed and the cost is calculated, only then the updates is run, correct?

Doesn't that means we should use the current cost, and with the existing param_update value (that comes from the previous iteration), we should calculate the newer param_update and thus update the current param value?

Why is it the other way round and why is that correct?

Daniel Renshaw · Accepted Answer

The order of updates inside the updates list provided to theano.function is ignored. Updates are always computed using the the old values of shared variables.

This snippet of code shows that the order of updates is ignored:

import theano
import theano.tensor

p = 0.5
param = theano.shared(1.)
param_update = theano.shared(2.)
cost = 3 * param * param
update_a = (param, param - param_update)
update_b = (param_update, p * param_update + (1 - p) * theano.grad(cost, param))
updates1 = [update_a, update_b]
updates2 = [update_b, update_a]
f1 = theano.function([], outputs=[param, param_update], updates=updates1)
f2 = theano.function([], outputs=[param, param_update], updates=updates2)
print f1(), f1()
param.set_value(1)
param_update.set_value(2)
print f2(), f2()

If, logically, you want

new_a = old_a + a_update
new_b = new_a + b_update

Then you need to provide updates like this:

new_a = old_a + a_update
new_b = old_a + a_update + b_update

Is there an error in the way the params are updated in the following Theano method?

Answers (1)

Related Questions