david streader
david streader

Reputation: 589

Are reverse mode AD and forward mode AD the same function?

Hi I am old to programming but very new to Julia so the answer may be obvious.

Forward Mode AD is often compared with the forward pass of a neural net, Reverse Mode AD is compared with back propagation and clearly you cannot replace back propagation with a forward pass. Forward and Reverse Mode AD both compute the gradient. But are they the same function or, ignoring efficiency, is Reverse Mode doing something the Forward Mode is not? Alternatively are there applications using Reverse Mode where, ignoring efficiency, Forward Mode could not be used?

The reason for asking this is that the paper https://richarde.dev/papers/2022/ad/higher-order-ad.pdf Provably Correct, Asymptotically Efficient, Higher-Order Reverse-Mode Automatic Differentiation defines Reverse Mode AD as a correct and efficient way to compute the gradient when there is a large number of inputs. Yet their definition computes the same function as Froward Mode AD.

Any correction of my understanding much appreciated.

Clarification:

Let Forward Algorithm means an algorithm that takes dx and computes df(x) and Reverse Algorithm mean an algorithm the takes df(x) and computes possible dx.

The easiest Automatic Differentiation algorithm to understand is a Forward Algorithm called Forward Mode Automatic Differentiation but is not efficient when there is a large vector of inputs (x). Hence Reverse Algorithms were invented that were efficient with a large vector of inputs these were called Reverse Mode Automatic Differentiation. These Reverse Algorithms appear much harder both to understand and to implement.

In the paper cited a Haskel forward algorithm for computing Automatic Differentiation was given that is efficient with a large vector of inputs. This, because of the efficiency, was not unreasonably called Reverse Mode Automatic Differentiation. On the assumption both that they are correct and that their algorithm can be implemented for Julia programs does it mean that the Reverse Algorithms are no longer useful. Or is there some use cases (not Automatic Differentiation) for which Reverse Algorithms are still needed?

Upvotes: 4

Views: 870

Answers (1)

phipsgabler
phipsgabler

Reputation: 20960

It depends on what you mean by "computes the same function". Using it the right way, the same numbers (up to numerics) will fall out at the end -- the gradient. The operations involved are different, though.

AD always evaluates a certain linear operator at a given point. In the case of forward mode, the operator is the Jacobian J, and for backward mode, it's its adjoint J'. But as you have the operator only in algorithmic form, not in matrix form, you cannot just "transpose it" to get one from the other. Instead, you can evaluate both, but in different bases, to recover the same gradient.

In essence, it's a question of matrix multiplication: do you compute Jv for all one-hot vectors v, or dJ' for the scalar d = 1 to get the whole gradient at once? The inefficiency of forward mode comes from the fact that in R^n -> R functions, these bases (the number of one-hot vectors) scale with n.

So: they are not the same function, but adjoint, and you can recover either result from the other.

Sorry if that's a bit dense, but anything more would become a lecture introducing AD. SO is probably not the right place for that. You should take the time to find some introductory material, though.

Upvotes: 3

Related Questions