Reputation: 28022
I was going through a tutorial online for momentum based learning and came across this method in Theano
def gradient_updates_momentum(cost, params, learning_rate, momentum):
'''
Compute updates for gradient descent with momentum
:parameters:
- cost : theano.tensor.var.TensorVariable
Theano cost function to minimize
- params : list of theano.tensor.var.TensorVariable
Parameters to compute gradient against
- learning_rate : float
Gradient descent learning rate
- momentum : float
Momentum parameter, should be at least 0 (standard gradient descent) and less than 1
:returns:
updates : list
List of updates, one for each parameter
'''
# Make sure momentum is a sane value
assert momentum < 1 and momentum >= 0
# List of update steps for each parameter
updates = []
# Just gradient descent on cost
for param in params:
# For each parameter, we'll create a param_update shared variable.
# This variable will keep track of the parameter's update step across iterations.
# We initialize it to 0
param_update = theano.shared(param.get_value()*0., broadcastable=param.broadcastable)
# Each parameter is updated by taking a step in the direction of the gradient.
# However, we also "mix in" the previous step according to the given momentum value.
# Note that when updating param_update, we are using its old value and also the new gradient step.
updates.append((param, param - learning_rate*param_update))
# Note that we don't need to derive backpropagation to compute updates - just use T.grad!
updates.append((param_update, momentum*param_update + (1. - momentum)*T.grad(cost, param)))
return updates
Shouldn't the order of the following two lines be the other way round (interchanged) ?
updates.append((param, param - learning_rate*param_update))
and
updates.append((param_update, momentum*param_update + (1. - momentum)*T.grad(cost, param)))
I understand that after the train method is executed and the cost is calculated, only then the updates is run, correct?
Doesn't that means we should use the current cost, and with the existing param_update value (that comes from the previous iteration), we should calculate the newer param_update and thus update the current param value?
Why is it the other way round and why is that correct?
Upvotes: 1
Views: 182
Reputation: 34187
The order of updates inside the updates list provided to theano.function
is ignored. Updates are always computed using the the old values of shared variables.
This snippet of code shows that the order of updates is ignored:
import theano
import theano.tensor
p = 0.5
param = theano.shared(1.)
param_update = theano.shared(2.)
cost = 3 * param * param
update_a = (param, param - param_update)
update_b = (param_update, p * param_update + (1 - p) * theano.grad(cost, param))
updates1 = [update_a, update_b]
updates2 = [update_b, update_a]
f1 = theano.function([], outputs=[param, param_update], updates=updates1)
f2 = theano.function([], outputs=[param, param_update], updates=updates2)
print f1(), f1()
param.set_value(1)
param_update.set_value(2)
print f2(), f2()
If, logically, you want
new_a = old_a + a_update
new_b = new_a + b_update
Then you need to provide updates like this:
new_a = old_a + a_update
new_b = old_a + a_update + b_update
Upvotes: 2