Reputation: 2611
I'm loosely following this tutorial to get a feel for simple tensorflow calculations. For those not wanting to click the link, it is a simple OLS problem of fitting y = Wx + b, with true solution: y = 2x
and have the following code and output
import tensorflow as tf
tf.reset_default_graph()
import numpy as np
x = tf.placeholder(tf.float32, [None, 1]) # 1d input vector
W = tf.Variable(tf.zeros([1,1]))
b = tf.Variable(tf.zeros([1]))
y = tf.matmul(x,W) + b
y_res = tf.placeholder(tf.float32, [None, 1])
cost = tf.reduce_sum(tf.pow(y - y_res, 2))
x_l = np.array([[i] for i in range(100)])
y_l = 2 * x_l
train = tf.train.GradientDescentOptimizer(0.000001).minimize(cost)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for i in range(5):
feed = {x: x_l,y_res:y_l}
sess.run(train, feed_dict=feed)
print ("iteration", i)
print ("W", sess.run(W))
print ("B", sess.run(b))
for which I get the reasonable answer
('iteration', 0)
('W', array([[ 1.31340003]], dtype=float32))
('B', array([ 0.0198], dtype=float32))
('iteration', 1)
('W', array([[ 1.76409423]], dtype=float32))
('B', array([ 0.02659338], dtype=float32))
('iteration', 2)
('W', array([[ 1.91875029]], dtype=float32))
('B', array([ 0.02892353], dtype=float32))
('iteration', 3)
('W', array([[ 1.97182059]], dtype=float32))
('B', array([ 0.02972212], dtype=float32))
('iteration', 4)
('W', array([[ 1.99003172]], dtype=float32))
('B', array([ 0.02999515], dtype=float32))
However, I have been looking to take things further and understand some of the other optimizers implemented, specifically ADAM
To look at the effect of this optimizer, I changed the relevant line to
train = tf.train.AdamOptimizer().minimize(cost)
Which gives the slightly strange results:
('iteration', 0)
('W', array([[ 0.001]], dtype=float32))
('B', array([ 0.001], dtype=float32))
('iteration', 1)
('W', array([[ 0.00199998]], dtype=float32))
('B', array([ 0.00199998], dtype=float32))
('iteration', 2)
('W', array([[ 0.00299994]], dtype=float32))
('B', array([ 0.00299994], dtype=float32))
('iteration', 3)
('W', array([[ 0.00399987]], dtype=float32))
('B', array([ 0.00399987], dtype=float32))
('iteration', 4)
('W', array([[ 0.00499976]], dtype=float32))
('B', array([ 0.00499976], dtype=float32))
Now, I have messed around with learning rate here etc, but I am somewhat baffled as to why this is having such a hard time converging. Does anyone have any intuition as to why this optimizer is failing on such a trivial problem
Upvotes: 1
Views: 4998
Reputation: 24651
This optimizer, as well as most others proposed in tf, aims at improving gradient descent for stochastic optimization. In one way or another, these optimizers slowly build up knowledge (momentum, moments, ...) to eventually outperform basic gradient descent.
Your experiment is not stochastic, and simple enough to converge rapidly with gradient descent. Both are unfavorable conditions for more elaborate optimizers to shine.
Upvotes: 2
Reputation: 843
I would not say that ADAM it's having a hard time converging neither failing, it's just taking its time:
iteration 14499
W [[ 1.9996556]]
B [ 0.02274081]
The abstract of the paper you linked says for what kind of problems ADAM is best suited for, and this is not one. Try SGD in one of those problems and you will see a real fail.
Upvotes: 1