Ngoc N. Tran
Ngoc N. Tran

Reputation: 1068

GradientTape with Keras returns 0

I've tried using GradientTape with a Keras model (simplified) as follows:

import tensorflow as tf
tf.enable_eager_execution()

input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')

import numpy as np
inp = tf.Variable(np.random.random((1,28,28)), dtype=tf.float32, name='input')
target = tf.constant([[1,0,0,0,0,0,0,0,0,0]], dtype=tf.float32)
with tf.GradientTape(persistent=True) as g:
    g.watch(inp)
    result = model(inp, training=False)

print(tf.reduce_max(tf.abs(g.gradient(result, inp))))

But for some random values of inp, the gradient is zero everywhere, and for the rest, the gradient magnitude is really small (<1e-7).

I've also tried this with a MNIST-trained 3-layer MLP and the results are the same, but trying it with a 1-layer Linear model with no activation works.

What's going on here?

Upvotes: 3

Views: 2241

Answers (2)

xdurch0
xdurch0

Reputation: 10474

You are computing gradients of a softmax output layer -- since softmax always always sums to 1, it makes sense that the gradients (which, in a multi-putput case, are summed/averaged over dimensions AFAIK) must be 0 -- the overall output of the layer cannot change. The cases where you get small values > 0 are numerical hiccups, I presume.
When you remove the activation function, this limitation no longer holds and the activations can become larger (meaning gradients with magnitude > 0).

Are you trying to use gradient descent to construct inputs that result in a very large probability for a certain class (if not, disregard this...)? @jdehesa already included a way to do this via the loss function. Note that you can do it via the softmax as well, like so:

import tensorflow as tf
tf.enable_eager_execution()

input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')

import numpy as np
inp = tf.Variable(np.random.random((1,28,28)), dtype=tf.float32, name='input')   
with tf.GradientTape(persistent=True) as g:
    g.watch(inp)
    result = model(inp, training=False)[:,0]

print(tf.reduce_max(tf.abs(g.gradient(result, inp))))

Note that I grab only the results in column 0, corresponding to the first class (I removed target because it's not used). This will compute gradients only for the softmax value for this class, which are meaningful.

Some caveats:

  • It's important to do the indexing inside the gradient tape context manager! If you do it outside (e.g. in the line where you call g.gradient, this will not work (no gradients)
  • You can also use gradients of the logits (pre-softmax values) instead. This is different, because softmax probabilities can be increased by making other classes less likely, whereas logits can only be increased by increasing the "score" for the class in question.

Upvotes: 4

javidcf
javidcf

Reputation: 59701

Computing the gradients against the output of the model is not usually very meaningful, in general you compute the gradients against the loss, which is what tells the model where the variables should go to reach your goal. In this case, you would be optimizing your input instead of the model parameters, but it is the same.

import tensorflow as tf
import numpy as np
tf.enable_eager_execution()  # Not necessary in TF 2.x

tf.random.set_random_seed(0)  # tf.random.set_seed in TF 2.x
np.random.seed(0)
input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')

inp = tf.Variable(np.random.random((1, 28, 28)), dtype=tf.float32, name='input')
target = tf.constant([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=tf.float32)
with tf.GradientTape(persistent=True) as g:
    g.watch(inp)
    result = model(inp, training=False)
    # Get the loss for the example
    loss = tf.keras.losses.categorical_crossentropy(target, result)

print(tf.reduce_max(tf.abs(g.gradient(loss, inp))))
# tf.Tensor(0.118953675, shape=(), dtype=float32)

Upvotes: 2

Related Questions