Reputation: 366
in http://caffe.berkeleyvision.org/tutorial/solver.html
it said
Note that the momentum setting μ effectively multiplies the size of your updates by a factor of 1/(1-μ) after many iterations of training, so if you increase μ , it may be a good idea to decrease α accordingly (and vice versa).
My question is:
why 1/(1-μ) , how to prove that?
why it's a good idea to decrease α according to an increasing μ?
Upvotes: 3
Views: 781
Reputation: 237
Simply put, it's the sum of a Geometric Progression.
Update with momentum means that the "velocity" and "position" are updated as follows:
v = μ * v + α * gradient
θ = θ - v
Now, assuming that initially v = 0 and the gradient remains (nearly) constant (say 1 for convenience), the velocity evolves as:
(Using the formula for the sum of an infinite geometric progression)
EDIT: To answer the second part of your question, (adding to @Prune's answer below) the 1/(1 - μ) * α behaves more or less like an "effective learning rate". So if some particular value of α was working well before you changed μ, you should compensate by decreasing α to keep the "effective learning rate" constant. This is as important as selecting the correct learning rate in gradient descent without momentum.
Upvotes: 3
Reputation: 77857
Speaking to your second point, you generally want the velocity tuned to a value compatible with your problem. The velocity describes the movement of your estimated solution point. If the velocity is too small, you converge too slowly, and/or overfit; if it's too large, you can thrash around the solution point, and even fail to converge.
Most algorithms will have controls for this second problem, often simply reducing α by a small factor (such as .01) whenever we find a new best-ever loss. The part you need to control is your initial setting. If you increase μ such that 1/(1-μ) goes up by a factor of 1.25, try reducing α by 20% to compensate.
Upvotes: 2