Reputation: 2279
I know that optimizers in Tensorflow divide minimize
into compute_gradients
and apply_gradients
. However, optimization algorithms like Adam generally process the gradients with momentum and some other techniques as the following figure suggests(Thanks @kmario23 for providing the figure).
I wonder when these techniques are applied to the gradients? Are they applied in
compute_gradients
or apply_gradients
?
sess = tf.Session()
x = tf.placeholder(tf.float32, [None, 1])
y = tf.layers.dense(x, 1)
loss = tf.losses.mean_squared_error(tf.ones_like(y), y)
opt = tf.train.AdamOptimizer()
grads = opt.compute_gradients(loss)
sess.run(tf.global_variables_initializer())
print(sess.run(grads, feed_dict={x: [[1]]}))
print(sess.run(grads, feed_dict={x: [[1]]}))
The above code outputs the same results twice, does it suggest that moment estimates are computed in apply_gradients
? Because, IMHO, if moment estimates are computed in apply_gradients
, then after the first print
statement, first and second moments will be updated, which should result in different result in the second print
statement.
Upvotes: 1
Views: 229
Reputation: 2279
compute_gradients
computes only gradients, all other additional operations corresponding to specific optimization algorithms are done in apply_gradients
. The code in the update is one evidence, another evidence is the following figure cropped from tensorboard, where Adam corresponds to the compute_gradient
operation.
Upvotes: 1
Reputation: 61455
Below is the Adam algorithm as presented in the Deep Learning book. As for your question, the important thing to note here is the gradient of theta (written as Laplacian of theta) in second to last step.
As for how TensorFlow computes this is a two step process in the optimization (i.e. minimization)
In the first step all the necessary ingredients for the final gradients are computed. So, the second step is just applying the update to the parameters based on the gradients computed in the first step and the learning rate (lr
).
Upvotes: 2