Maybe
Maybe

Reputation: 2279

Adam in Tensorflow: where does moment estimates happen?

I know that optimizers in Tensorflow divide minimize into compute_gradients and apply_gradients. However, optimization algorithms like Adam generally process the gradients with momentum and some other techniques as the following figure suggests(Thanks @kmario23 for providing the figure). enter image description here I wonder when these techniques are applied to the gradients? Are they applied in compute_gradients or apply_gradients?

Update

sess = tf.Session()
x = tf.placeholder(tf.float32, [None, 1])
y = tf.layers.dense(x, 1)
loss = tf.losses.mean_squared_error(tf.ones_like(y), y)
opt = tf.train.AdamOptimizer()
grads = opt.compute_gradients(loss)
sess.run(tf.global_variables_initializer())
print(sess.run(grads, feed_dict={x: [[1]]}))
print(sess.run(grads, feed_dict={x: [[1]]}))

The above code outputs the same results twice, does it suggest that moment estimates are computed in apply_gradients? Because, IMHO, if moment estimates are computed in apply_gradients, then after the first print statement, first and second moments will be updated, which should result in different result in the second printstatement.

Upvotes: 1

Views: 229

Answers (2)

Maybe
Maybe

Reputation: 2279

compute_gradients computes only gradients, all other additional operations corresponding to specific optimization algorithms are done in apply_gradients. The code in the update is one evidence, another evidence is the following figure cropped from tensorboard, where Adam corresponds to the compute_gradient operation. enter image description here

Upvotes: 1

kmario23
kmario23

Reputation: 61455

Below is the Adam algorithm as presented in the Deep Learning book. As for your question, the important thing to note here is the gradient of theta (written as Laplacian of theta) in second to last step.

adam

As for how TensorFlow computes this is a two step process in the optimization (i.e. minimization)

In the first step all the necessary ingredients for the final gradients are computed. So, the second step is just applying the update to the parameters based on the gradients computed in the first step and the learning rate (lr).

Upvotes: 2

Related Questions