Long
Long

Reputation: 366

caffe solver document. how to understand the momentum μ has an effect of factor $1/(1-μ)$?

in http://caffe.berkeleyvision.org/tutorial/solver.html

it said

Note that the momentum setting μ effectively multiplies the size of your updates by a factor of 1/(1-μ) after many iterations of training, so if you increase μ , it may be a good idea to decrease α accordingly (and vice versa).

My question is:

  1. why 1/(1-μ) , how to prove that?

  2. why it's a good idea to decrease α according to an increasing μ?

Upvotes: 3

Views: 781

Answers (2)

gunner
gunner

Reputation: 237

Simply put, it's the sum of a Geometric Progression.

Update with momentum means that the "velocity" and "position" are updated as follows:

v = μ * v + α * gradient

θ = θ - v

Now, assuming that initially v = 0 and the gradient remains (nearly) constant (say 1 for convenience), the velocity evolves as:

  • 0,
  • α,
  • (1 + μ) * α,
  • (1 + μ(1 + μ)) * α = (1 + μ + μ^2) * α,
  • (1 + μ + μ^2 + μ^3) * α,
  • (1 + μ + μ^2 + μ^3 + μ^4) * α,
  • (1 + μ + μ^2 + μ^3 + μ^4 + μ^5) * α,
  • ...
  • 1/(1 - μ) * α

(Using the formula for the sum of an infinite geometric progression)

EDIT: To answer the second part of your question, (adding to @Prune's answer below) the 1/(1 - μ) * α behaves more or less like an "effective learning rate". So if some particular value of α was working well before you changed μ, you should compensate by decreasing α to keep the "effective learning rate" constant. This is as important as selecting the correct learning rate in gradient descent without momentum.

Upvotes: 3

Prune
Prune

Reputation: 77857

Speaking to your second point, you generally want the velocity tuned to a value compatible with your problem. The velocity describes the movement of your estimated solution point. If the velocity is too small, you converge too slowly, and/or overfit; if it's too large, you can thrash around the solution point, and even fail to converge.

Most algorithms will have controls for this second problem, often simply reducing α by a small factor (such as .01) whenever we find a new best-ever loss. The part you need to control is your initial setting. If you increase μ such that 1/(1-μ) goes up by a factor of 1.25, try reducing α by 20% to compensate.

Upvotes: 2

Related Questions