Reputation: 17478
In tf.gradients
, there is a keyword argument grad_ys
grad_ys
is a list of tensors of the same length asys
that holds the initial gradients for eachy
inys
. Whengrad_ys
is None, we fill in a tensor of ‘1’s of the shape ofy
for eachy
inys
. A user can provide their own initialgrad_ys
to compute the derivatives using a different initial gradient for each y (e.g., if one wanted to weight the gradient differently for each value in each y).
Why is grads_ys
needed here? The docs here is implicit. Could you please give some specific purpose and code?
And my example code for tf.gradients
is
In [1]: import numpy as np
In [2]: import tensorflow as tf
In [3]: sess = tf.InteractiveSession()
In [4]: X = tf.placeholder("float", shape=[2, 1])
In [5]: Y = tf.placeholder("float", shape=[2, 1])
In [6]: W = tf.Variable(np.random.randn(), name='weight')
In [7]: b = tf.Variable(np.random.randn(), name='bias')
In [8]: pred = tf.add(tf.multiply(X, W), b)
In [9]: cost = 0.5 * tf.reduce_sum(tf.pow(pred-Y, 2))
In [10]: grads = tf.gradients(cost, [W, b])
In [11]: sess.run(tf.global_variables_initializer())
In [15]: W_, b_, pred_, cost_, grads_ = sess.run([W, b, pred, cost, grads],
feed_dict={X: [[2.0], [3.]], Y: [[3.0], [2.]]})
Upvotes: 3
Views: 2003
Reputation: 3643
grad_ys
is only needed for advanced use cases. Here is how you can think about it.
tf.gradients
allows you to compute tf.gradients(y, x, grad_ys) = grad_ys * dy/dx
. In other words, grad_ys
is the multiplier of each y
. In this notation, it seems silly to provide this argument because one should be able to just multiple himself, i.e. tf.gradients(y, x, grad_ys) = grad_ys * tf.gradients(y, x)
. Unfortunately, this equality does not hold because when computing gradients backwards, we perform reduction (typically summation) after each step to get "intermediate loss".
This functionality can be useful in many cases. One is mentioned in the doc string. Here is another. Remember the chain rule - dz/dx = dz/dy * dy/dx
. Let's say that we wanted to compute dz/dx
but dz/dy
is not differentiable and we can only approximate it. Let's say we compute the approximation somehow and call it approx
. Then, dz/dx = tf.gradients(y, x, grad_ys=approx)
.
Another use case can be when you have a model with a "huge fan-in". Let's say you have 100 input sources that go through a few layers (call these "100 branches"), get combined at y
, and go through 10 more layers until you get to a loss
. It might be that computing all the gradients (which requires remembering many activations) for the whole model at once does not fit in memory. One way to do this would be to compute d(loss)/dy
first. Then, compute the gradients for variables in branch_i
with respect to loss
using tf.gradients(y, branch_i_variables, grad_ys=d(loss)/dy)
. Using this (and a few more details I am skipping) you can reduce the peak memory requirement.
Upvotes: 4