Tensorflow AdamOptimizer vs Gradient Descent

Question

I'm loosely following this tutorial to get a feel for simple tensorflow calculations. For those not wanting to click the link, it is a simple OLS problem of fitting y = Wx + b, with true solution: y = 2x

and have the following code and output

import tensorflow as tf
tf.reset_default_graph()
import numpy as np

x = tf.placeholder(tf.float32, [None, 1]) # 1d input vector
W = tf.Variable(tf.zeros([1,1]))
b = tf.Variable(tf.zeros([1]))

y = tf.matmul(x,W) + b

y_res = tf.placeholder(tf.float32, [None, 1])

cost = tf.reduce_sum(tf.pow(y - y_res, 2))

x_l = np.array([[i] for i in range(100)])
y_l = 2 * x_l

train = tf.train.GradientDescentOptimizer(0.000001).minimize(cost)

init = tf.initialize_all_variables()

with tf.Session() as sess:
    sess.run(init)
    for i in range(5):
        feed = {x: x_l,y_res:y_l}
        sess.run(train, feed_dict=feed)

        print ("iteration", i)
        print ("W", sess.run(W))
        print ("B", sess.run(b))

for which I get the reasonable answer

('iteration', 0)
('W', array([[ 1.31340003]], dtype=float32))
('B', array([ 0.0198], dtype=float32))
('iteration', 1)
('W', array([[ 1.76409423]], dtype=float32))
('B', array([ 0.02659338], dtype=float32))
('iteration', 2)
('W', array([[ 1.91875029]], dtype=float32))
('B', array([ 0.02892353], dtype=float32))
('iteration', 3)
('W', array([[ 1.97182059]], dtype=float32))
('B', array([ 0.02972212], dtype=float32))
('iteration', 4)
('W', array([[ 1.99003172]], dtype=float32))
('B', array([ 0.02999515], dtype=float32))

However, I have been looking to take things further and understand some of the other optimizers implemented, specifically ADAM

To look at the effect of this optimizer, I changed the relevant line to

train = tf.train.AdamOptimizer().minimize(cost)

Which gives the slightly strange results:

('iteration', 0)
('W', array([[ 0.001]], dtype=float32))
('B', array([ 0.001], dtype=float32))
('iteration', 1)
('W', array([[ 0.00199998]], dtype=float32))
('B', array([ 0.00199998], dtype=float32))
('iteration', 2)
('W', array([[ 0.00299994]], dtype=float32))
('B', array([ 0.00299994], dtype=float32))
('iteration', 3)
('W', array([[ 0.00399987]], dtype=float32))
('B', array([ 0.00399987], dtype=float32))
('iteration', 4)
('W', array([[ 0.00499976]], dtype=float32))
('B', array([ 0.00499976], dtype=float32))

Now, I have messed around with learning rate here etc, but I am somewhat baffled as to why this is having such a hard time converging. Does anyone have any intuition as to why this optimizer is failing on such a trivial problem

J63 · Accepted Answer

I would not say that ADAM it's having a hard time converging neither failing, it's just taking its time:

iteration 14499
W [[ 1.9996556]]
B [ 0.02274081]

The abstract of the paper you linked says for what kind of problems ADAM is best suited for, and this is not one. Try SGD in one of those problems and you will see a real fail.

Tensorflow AdamOptimizer vs Gradient Descent

Answers (2)

Related Questions