Reputation: 1816
To calculate the derivative of an output layer of size N
w.r.t an input of size M
, we need a Jacobian matrix of size M x N
. To calculate a complete gradient from loss to inputs using the chain rule, we would need a large number of such Jacobians stored in memory.
I assume that tensorflow does not calculate a complete Jacobian matrix for each step of the graph, but does something more efficient. How does it do it?
Thanks
Upvotes: 2
Views: 695
Reputation: 11895
TensorFlow uses Automatic Differentiation to compute gradients efficiently. Concretely, it defines a computation graph in which nodes are operations and each directed edge represents the partial derivative of a child with respect to its parent. The total derivative of an operation f with respect to x is then given by the sum over all path values from x to f, where each path value is the product of the partial derivatives of the operations on the edges.
More specifically, TensorFlow uses reverse differentiation, which involves a forward pass to compute the value of each node in the computation graph, and a backward pass to compute the partial derivative of the function f that we are differentiating with respect to every node in the graph. We need to repeat the backward pass for each dimension of function f, so the computational complexity is O(dim(f))*O(f), where dim(f) is the output dimensionality of function f.
Although this approach is memory intensive (it requires storing the values of all the nodes before running the backward pass), it is very efficient for machine learning, where we typically have a scalar function f (i.e. dim(f)=1).
You might find this resource useful.
Upvotes: 3