gphilip
gphilip

Reputation: 706

MLPRegressor learning_rate_init for lbfgs solver in sklearn

For a school project I need to evaluate a neural network with different learning rates. I chose sklearn to implement the neural network (using the MLPRegressor class). Since the training data is pretty small (20 instances, 2 inputs and 1 output each) I decided to use the lbfgs solver, since stochastic solvers like sgd and adam for this size of data don't make sense.

The project mandates testing the neural network with different learning rates. That, however, is not possible with the lbfgs solver according to the documentation:

learning_rate_init double, default=0.001 The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.

Is there a way I can access the learning rate of the lbfgs solver somehow and modify it or that question doesn't even make sense?

Upvotes: 1

Views: 4248

Answers (2)

A Co
A Co

Reputation: 998

LBFGS is an optimization algorithm that simply does not use a learning rate. For the purpose of your school project, you should use either sgd or adam. Regarding whether it makes more sense or not, I would say that training a neural network on 20 data points doesn't make a lot of sense anyway, except for learning the basics.

LBFGS is a quasi-newton optimization method. It is based on the hypothesis that the function you seek to optimize can be approximated locally by a second order Taylor development. It roughly proceeds like this:

  • Start from an initial guess
  • Use the Jacobian matrix to compute the direction of steepest descent
  • Use the Hessian matrix to compute the descent step and reach the next point
  • repeat until convergence

The difference with Newton methods is that quasi Newton methods use approximates for the Jacobian and/or Hessian matrices.

Newton and quasi-newton methods requires more smoothness from the function to optimize than the gradient descent, but converge faster. Indeed, computing the descent step with the Hessian matrix is more efficient because it can foresee the distance to the local optimum, thus not ending up oscillating around it or converging very slowly. On the other side, the gradient descent only use the Jacobian matrix (first order derivatives) to compute the direction of steepest descent and use the learning rate as the descent step.

Practically the gradient descent is used in deep learning because computing the Hessian matrix would be too expensive.

Here it makes no sense to talk about a learning rate for Newton methods (or Quasi-Newton methods), it is just not applicable.

Upvotes: 5

Bernardo stearns reisen
Bernardo stearns reisen

Reputation: 2657

Not a complete answer, but hopefully a good pointer.

The sklearn.neural_network.MLPRegressor is implemented multilayer_perceptron module on github.

by inspecting the module I noticed that differently from other solvers, scitkit implements the lbfgs algorithm in the Base class itself. So you can easily adapt it.

It seems that they don't use any learning rate, so you could adapt this code and multiply the loss by a learning rate you want to test. I'm just not totally sure if it makes sense adding a learning rate in the context of lbfgs.

I believe the loss if being used here:

        opt_res = scipy.optimize.minimize(
                self._loss_grad_lbfgs, packed_coef_inter,
                method="L-BFGS-B", jac=True,
                options={
                    "maxfun": self.max_fun,
                    "maxiter": self.max_iter,
                    "iprint": iprint,
                    "gtol": self.tol
                },

the code is located in line 430 of the _multilayer_perceptron.py module

def _fit_lbfgs(self, X, y, activations, deltas, coef_grads,
                   intercept_grads, layer_units):
        # Store meta information for the parameters
        self._coef_indptr = []
        self._intercept_indptr = []
        start = 0

        # Save sizes and indices of coefficients for faster unpacking
        for i in range(self.n_layers_ - 1):
            n_fan_in, n_fan_out = layer_units[i], layer_units[i + 1]

            end = start + (n_fan_in * n_fan_out)
            self._coef_indptr.append((start, end, (n_fan_in, n_fan_out)))
            start = end

        # Save sizes and indices of intercepts for faster unpacking
        for i in range(self.n_layers_ - 1):
            end = start + layer_units[i + 1]
            self._intercept_indptr.append((start, end))
            start = end

        # Run LBFGS
        packed_coef_inter = _pack(self.coefs_,
                                  self.intercepts_)

        if self.verbose is True or self.verbose >= 1:
            iprint = 1
        else:
            iprint = -1

        opt_res = scipy.optimize.minimize(
                self._loss_grad_lbfgs, packed_coef_inter,
                method="L-BFGS-B", jac=True,
                options={
                    "maxfun": self.max_fun,
                    "maxiter": self.max_iter,
                    "iprint": iprint,
                    "gtol": self.tol
                },
                args=(X, y, activations, deltas, coef_grads, intercept_grads))
        self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
        self.loss_ = opt_res.fun
        self._unpack(opt_res.x)

Upvotes: 0

Related Questions