Reputation: 555
There are several experiments that rely on gradient ascent rather than gradient descent. I have looked into some approaches to using "cost" and the minimize function to simulate the "maximize" function, but I am still not certain I know how to properly implement a maximize() function. Also, in most of these cases, I would say they are closer to an unsupervised learning. So given this code concept for a cost function:
cost = (Yexpected - Ycalculated)^2
train_step = tf.train.AdamOptimizer(0.5).minimize(cost)
I would like to write something were I am following the positive gradient and there may not be a Yexpected value:
maxMe = Function(Ycalculated)
train_step = tf.train.AdamOptimizer(0.5).maximize(maxMe)
A good example of this need is "http://cs229.stanford.edu/proj2009/LvDuZhai.pdf" with Recurrent Reinforcement Learning.
I have read a few papers and references that state changing the sign will flip the direction of movement to increasing gradient, but given TensorFlow's internal calculation of the gradient, I am not sure if this will work to Maximize as I don't know of a way to validate the results:
maxMe = Function(Ycalculated)
train_step = tf.train.AdamOptimizer(0.5).minimize( -1 * maxMe )
Upvotes: 7
Views: 9372
Reputation: 14072
The intuition is simple, the minimize()
function keeps squashing the given value, for example, if you start with 5, then for every iteration (for example and depending on the learning rate), the value will become say, 4, then 3, then 2, 1, 0 and so on if possible to bring it down more. Now if you pass -5 at the beginning (which is in fact a +5 but you changed the sign explicitly), the gradient will try to change the parameters to bring the number down more, as for example, -5, -6, -7, -8, ...etc. But in fact, the function is increasing because we changed the sign, and the actual sign is (+). In other words, the gradient, in the latter case, is changing the parameters of the neural network in a way that maximizes
the function, not minimizing it.
The input x = 1.5, The weight parameter at time (t) w_t = 0.1,
The observed response y = 3.0, The learning rate lr = 0.1.
x * w = 0.15 (this is y predicted for the current w)
loss function = (3.0 - 0.15)^2 = 8.1
Applying gradient descent:
w_(t+1) = w_t - lr * (derivative of loss function with respect to w)
w_(t+1) = 0.1 - (0.1 * [1.5 * 2(0.15 - 3.0)]) = 0.1 - (-0.855) = 0.955
If we use the new w_(t+1)
we will have:
1.5 * 0.955 = 1.49 (which is closer to the correct answer 3.0)
and the new loss is: (3.0 - 1.49)^2 = 2.27 (smaller error).
If we keep iterating, we will adjust w
to a value that gives us the minimum cost possible.
Now lets repeat the same experiment but with the sign flipped to negative:
loss function = - (3.0 - 0.15)^2 = -8.1
Applying gradient descent:
w_(t+1) = w_t - lr * (derivative of loss function with respect to w)
w_(t+1) = 0.1 - (0.1 * [1.5 * -2(0.15 - 3.0)]) = 0.1 - 0.855 = −0.755
If we apply the new w_(t+1)
we will have:
1.5 * −0.755 = −1.1325 and the new loss is: (3.0 - (-1.1325))^2 = 17.07
(the loss function is maximizing!).
That is also applicable to any differentiable function, but this is just a simple naive example to demonstrate the idea.
So, you can do, as you suggested already:
optimizer.minimize( -1 * value)
Or if you like, create a wrapper function (which in fact is needless, but just to mention it):
def maximize(optimizer, value, **kwargs):
return optimizer.minimize(-value, **kwargs)
Upvotes: 5