tf.gradients, how can I understand `grad_ys` and use it?

Question

In tf.gradients, there is a keyword argument grad_ys

grad_ys is a list of tensors of the same length as ys that holds the initial gradients for each y in ys. When grad_ys is None, we fill in a tensor of ‘1’s of the shape of y for each y in ys. A user can provide their own initial grad_ys to compute the derivatives using a different initial gradient for each y (e.g., if one wanted to weight the gradient differently for each value in each y).

Why is grads_ys needed here? The docs here is implicit. Could you please give some specific purpose and code?

And my example code for tf.gradients is

In [1]: import numpy as np

In [2]: import tensorflow as tf

In [3]: sess = tf.InteractiveSession()

In [4]: X = tf.placeholder("float", shape=[2, 1])

In [5]: Y = tf.placeholder("float", shape=[2, 1])

In [6]: W = tf.Variable(np.random.randn(), name='weight')

In [7]: b = tf.Variable(np.random.randn(), name='bias')

In [8]: pred = tf.add(tf.multiply(X, W), b)

In [9]: cost = 0.5 * tf.reduce_sum(tf.pow(pred-Y, 2))

In [10]: grads = tf.gradients(cost, [W, b])

In [11]: sess.run(tf.global_variables_initializer())

In [15]: W_, b_, pred_, cost_, grads_ = sess.run([W, b, pred, cost, grads], 
                                    feed_dict={X: [[2.0], [3.]], Y: [[3.0], [2.]]})

iga · Accepted Answer

grad_ys is only needed for advanced use cases. Here is how you can think about it.

tf.gradients allows you to compute tf.gradients(y, x, grad_ys) = grad_ys * dy/dx. In other words, grad_ys is the multiplier of each y. In this notation, it seems silly to provide this argument because one should be able to just multiple himself, i.e. tf.gradients(y, x, grad_ys) = grad_ys * tf.gradients(y, x). Unfortunately, this equality does not hold because when computing gradients backwards, we perform reduction (typically summation) after each step to get "intermediate loss".

This functionality can be useful in many cases. One is mentioned in the doc string. Here is another. Remember the chain rule - dz/dx = dz/dy * dy/dx. Let's say that we wanted to compute dz/dx but dz/dy is not differentiable and we can only approximate it. Let's say we compute the approximation somehow and call it approx. Then, dz/dx = tf.gradients(y, x, grad_ys=approx).

Another use case can be when you have a model with a "huge fan-in". Let's say you have 100 input sources that go through a few layers (call these "100 branches"), get combined at y, and go through 10 more layers until you get to a loss. It might be that computing all the gradients (which requires remembering many activations) for the whole model at once does not fit in memory. One way to do this would be to compute d(loss)/dy first. Then, compute the gradients for variables in branch_i with respect to loss using tf.gradients(y, branch_i_variables, grad_ys=d(loss)/dy). Using this (and a few more details I am skipping) you can reduce the peak memory requirement.

tf.gradients, how can I understand `grad_ys` and use it?

Answers (1)

Related Questions