Dr. Prasanna Date
Dr. Prasanna Date

Reputation: 775

Hessian-Free Optimization versus Gradient Descent for DNN training

How do the Hessian-Free (HF) Optimization techniques compare against the Gradient Descent techniques (for e.g. Stochastic Gradient Descent (SGD), Batch Gradient Descent, Adaptive Gradient Descent) for training Deep Neural Networks (DNN)?

Under what circumstances should one prefer HF techniques as opposed to Gradient Descent techniques?

Upvotes: 2

Views: 1646

Answers (2)

mehdi
mehdi

Reputation: 139

I think if someone knows the difference, it helps to know when and where to use each method. I try to shed some lights on the concepts.

Gradient Descent is a type of first order optimization methods, and has been used in the training of Neural Networks, since second order methods, such as Newton's method, are computationally infeasible. However, second order methods show much better convergence characteristics than first order methods, because they also take into account the curvature of the error space.

Additionally, first order methods require a lot of tuning of the decrease parameter, which is application specific. They also have a tendency to get trapped in local optimum and exhibit slow convergence.

The reason for in-feasibility of Newton's method is the computation of the Hessian matrix, which takes prohibitively long. In order to overcome this issue, "Hessian free" learning is proposed in which one can use Newton's method without directly computing the Hessian matrix.

I don't wanna go into more details, but as far as I know, for deep network, it is highly recommended to use HF optimization (there are many improvement over HF approach as well) since it takes much less time for training, or using SGD with momentum.

Upvotes: 4

runDOSrun
runDOSrun

Reputation: 10995

In short, HFO is a way to avoid the vanishing gradient problem which comes from (naively) using backpropagation in deep nets. However, Deep Learning is about avoiding this issue tweaking the learning and/or architecture, so in the end it comes down to specific comparisons between each specific network model (and strategy, like pre-tuning) and HFO. There's a lot of recent studies on this topic but it's not fully explored yet. In some cases it performs better, in some it doesn't. Afaik (might be outdated soon) Elman-based RNNs (not LSTM) benefit from it the most.

Tl;dr: SGD is still the goto-method although flawed. Until someone finds a better way of non-SGD learning. HFO is one suggestion of many and but it's not found to be state-of-the-art yet.

Upvotes: 2

Related Questions