Reputation: 1816

how does tensorflow calculate gradients efficiently from input to loss?

To calculate the derivative of an output layer of size N w.r.t an input of size M, we need a Jacobian matrix of size M x N. To calculate a complete gradient from loss to inputs using the chain rule, we would need a large number of such Jacobians stored in memory.

I assume that tensorflow does not calculate a complete Jacobian matrix for each step of the graph, but does something more efficient. How does it do it?

Thanks

Upvotes: 2

Answers (1)

rvinas

Reputation: 11895

TensorFlow uses Automatic Differentiation to compute gradients efficiently. Concretely, it defines a computation graph in which nodes are operations and each directed edge represents the partial derivative of a child with respect to its parent. The total derivative of an operation f with respect to x is then given by the sum over all path values from x to f, where each path value is the product of the partial derivatives of the operations on the edges.

More specifically, TensorFlow uses reverse differentiation, which involves a forward pass to compute the value of each node in the computation graph, and a backward pass to compute the partial derivative of the function f that we are differentiating with respect to every node in the graph. We need to repeat the backward pass for each dimension of function f, so the computational complexity is O(dim(f))*O(f), where dim(f) is the output dimensionality of function f.

Although this approach is memory intensive (it requires storing the values of all the nodes before running the backward pass), it is very efficient for machine learning, where we typically have a scalar function f (i.e. dim(f)=1).

You might find this resource useful.

Upvotes: 3

how does tensorflow calculate gradients *efficiently* from input to loss?

Answers (1)

Related Questions

how does tensorflow calculate gradients efficiently from input to loss?