Why gradient of tanh in tensorflow is `grad = dy * (1 - y*y)`

Question

tf.raw_ops.TanhGrad says that grad = dy * (1 - y*y), where y = tanh(x).

But I think since dy / dx = 1 - y*y, where y = tanh(x), grad should be dy / (1 - y*y). Where am I wrong?

javidcf · Accepted Answer

An expression like dy / dx is a mathematical notation for the derivative, it is not an actual fraction. It is meaningless to move dy or dx around individually as you would with a numerator and denominator.

Mathematically, it is known that d(tanh(x))/dx = 1 - (tanh(x))^2. TensorFlow computes gradients "backwards" (what is called backpropagation, or more generally reverse automatic differentiation). That means that, in general, we will reach the computation of the gradient of tanh(x) after reaching the step where we compute the gradient of an "outer" function g(tanh(x)). g represents all the operations that are applied to the output of tanh to reach the value for which the gradient is computed. The derivative of this function g, according to the chain rule, is d(g(tanh(x)))/dx = d(g(tanh(x))/d(tanh(x)) * d(tanh(x))/dx. The first factor, d(g(tanh(x))/d(tanh(x)), is the reverse accumulated gradient up until tanh, that is, the derivate of all those later operations, and is the value of dy in the documentation of the function. Therefore, you only need to compute d(tanh(x))/dx (which is (1 - y * y), because y = tanh(x)) and multiply it by the given dy. The resulting value will then be propagated further back to the operation that produced the input x to tanh in the first place, and it will become the dy value in the computation of that gradient, and so on until the gradient sources are reached.

Why gradient of tanh in tensorflow is `grad = dy * (1 - y*y)`

Answers (1)

Related Questions