caffe solver document. how to understand the momentum μ has an effect of factor $1/(1-μ)$?

Question

in http://caffe.berkeleyvision.org/tutorial/solver.html

it said

Note that the momentum setting μ effectively multiplies the size of your updates by a factor of 1/(1-μ) after many iterations of training, so if you increase μ , it may be a good idea to decrease α accordingly (and vice versa).

My question is:

why 1/(1-μ) , how to prove that?

why it's a good idea to decrease α according to an increasing μ?

gunner · Accepted Answer

Simply put, it's the sum of a Geometric Progression.

Update with momentum means that the "velocity" and "position" are updated as follows:

v = μ * v + α * gradient

θ = θ - v

Now, assuming that initially v = 0 and the gradient remains (nearly) constant (say 1 for convenience), the velocity evolves as:

0,
α,
(1 + μ) * α,
(1 + μ(1 + μ)) * α = (1 + μ + μ^2) * α,
(1 + μ + μ^2 + μ^3) * α,
(1 + μ + μ^2 + μ^3 + μ^4) * α,
(1 + μ + μ^2 + μ^3 + μ^4 + μ^5) * α,
...
1/(1 - μ) * α

(Using the formula for the sum of an infinite geometric progression)

EDIT: To answer the second part of your question, (adding to @Prune's answer below) the 1/(1 - μ) * α behaves more or less like an "effective learning rate". So if some particular value of α was working well before you changed μ, you should compensate by decreasing α to keep the "effective learning rate" constant. This is as important as selecting the correct learning rate in gradient descent without momentum.

caffe solver document. how to understand the momentum μ has an effect of factor $1/(1-μ)$?

Answers (2)

Related Questions