user3684792
user3684792

Reputation: 2611

Tensorflow AdamOptimizer vs Gradient Descent

I'm loosely following this tutorial to get a feel for simple tensorflow calculations. For those not wanting to click the link, it is a simple OLS problem of fitting y = Wx + b, with true solution: y = 2x

and have the following code and output

import tensorflow as tf
tf.reset_default_graph()
import numpy as np

x = tf.placeholder(tf.float32, [None, 1]) # 1d input vector
W = tf.Variable(tf.zeros([1,1]))
b = tf.Variable(tf.zeros([1]))

y = tf.matmul(x,W) + b

y_res = tf.placeholder(tf.float32, [None, 1])

cost = tf.reduce_sum(tf.pow(y - y_res, 2))

x_l = np.array([[i] for i in range(100)])
y_l = 2 * x_l

train = tf.train.GradientDescentOptimizer(0.000001).minimize(cost)

init = tf.initialize_all_variables()

with tf.Session() as sess:
    sess.run(init)
    for i in range(5):
        feed = {x: x_l,y_res:y_l}
        sess.run(train, feed_dict=feed)

        print ("iteration", i)
        print ("W", sess.run(W))
        print ("B", sess.run(b))

for which I get the reasonable answer

('iteration', 0)
('W', array([[ 1.31340003]], dtype=float32))
('B', array([ 0.0198], dtype=float32))
('iteration', 1)
('W', array([[ 1.76409423]], dtype=float32))
('B', array([ 0.02659338], dtype=float32))
('iteration', 2)
('W', array([[ 1.91875029]], dtype=float32))
('B', array([ 0.02892353], dtype=float32))
('iteration', 3)
('W', array([[ 1.97182059]], dtype=float32))
('B', array([ 0.02972212], dtype=float32))
('iteration', 4)
('W', array([[ 1.99003172]], dtype=float32))
('B', array([ 0.02999515], dtype=float32))

However, I have been looking to take things further and understand some of the other optimizers implemented, specifically ADAM

To look at the effect of this optimizer, I changed the relevant line to

train = tf.train.AdamOptimizer().minimize(cost)

Which gives the slightly strange results:

('iteration', 0)
('W', array([[ 0.001]], dtype=float32))
('B', array([ 0.001], dtype=float32))
('iteration', 1)
('W', array([[ 0.00199998]], dtype=float32))
('B', array([ 0.00199998], dtype=float32))
('iteration', 2)
('W', array([[ 0.00299994]], dtype=float32))
('B', array([ 0.00299994], dtype=float32))
('iteration', 3)
('W', array([[ 0.00399987]], dtype=float32))
('B', array([ 0.00399987], dtype=float32))
('iteration', 4)
('W', array([[ 0.00499976]], dtype=float32))
('B', array([ 0.00499976], dtype=float32))

Now, I have messed around with learning rate here etc, but I am somewhat baffled as to why this is having such a hard time converging. Does anyone have any intuition as to why this optimizer is failing on such a trivial problem

Upvotes: 1

Views: 4998

Answers (2)

P-Gn
P-Gn

Reputation: 24651

This optimizer, as well as most others proposed in tf, aims at improving gradient descent for stochastic optimization. In one way or another, these optimizers slowly build up knowledge (momentum, moments, ...) to eventually outperform basic gradient descent.

Your experiment is not stochastic, and simple enough to converge rapidly with gradient descent. Both are unfavorable conditions for more elaborate optimizers to shine.

Upvotes: 2

J63
J63

Reputation: 843

I would not say that ADAM it's having a hard time converging neither failing, it's just taking its time:

iteration 14499
W [[ 1.9996556]]
B [ 0.02274081]

The abstract of the paper you linked says for what kind of problems ADAM is best suited for, and this is not one. Try SGD in one of those problems and you will see a real fail.

Upvotes: 1

Related Questions