MLPRegressor learning_rate_init for lbfgs solver in sklearn

Question

For a school project I need to evaluate a neural network with different learning rates. I chose sklearn to implement the neural network (using the MLPRegressor class). Since the training data is pretty small (20 instances, 2 inputs and 1 output each) I decided to use the lbfgs solver, since stochastic solvers like sgd and adam for this size of data don't make sense.

The project mandates testing the neural network with different learning rates. That, however, is not possible with the lbfgs solver according to the documentation:

learning_rate_init double, default=0.001 The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.

Is there a way I can access the learning rate of the lbfgs solver somehow and modify it or that question doesn't even make sense?

A Co · Accepted Answer

LBFGS is an optimization algorithm that simply does not use a learning rate. For the purpose of your school project, you should use either sgd or adam. Regarding whether it makes more sense or not, I would say that training a neural network on 20 data points doesn't make a lot of sense anyway, except for learning the basics.

LBFGS is a quasi-newton optimization method. It is based on the hypothesis that the function you seek to optimize can be approximated locally by a second order Taylor development. It roughly proceeds like this:

Start from an initial guess
Use the Jacobian matrix to compute the direction of steepest descent
Use the Hessian matrix to compute the descent step and reach the next point
repeat until convergence

The difference with Newton methods is that quasi Newton methods use approximates for the Jacobian and/or Hessian matrices.

Newton and quasi-newton methods requires more smoothness from the function to optimize than the gradient descent, but converge faster. Indeed, computing the descent step with the Hessian matrix is more efficient because it can foresee the distance to the local optimum, thus not ending up oscillating around it or converging very slowly. On the other side, the gradient descent only use the Jacobian matrix (first order derivatives) to compute the direction of steepest descent and use the learning rate as the descent step.

Practically the gradient descent is used in deep learning because computing the Hessian matrix would be too expensive.

Here it makes no sense to talk about a learning rate for Newton methods (or Quasi-Newton methods), it is just not applicable.

MLPRegressor learning_rate_init for lbfgs solver in sklearn

Answers (2)

Related Questions