Reputation: 706
For a school project I need to evaluate a neural network with different learning rates. I chose sklearn
to implement the neural network (using the MLPRegressor
class). Since the training data is pretty small (20 instances, 2 inputs and 1 output each) I decided to use the lbfgs
solver, since stochastic solvers like sgd
and adam
for this size of data don't make sense.
The project mandates testing the neural network with different learning rates. That, however, is not possible with the lbfgs
solver according to the documentation:
learning_rate_init double, default=0.001 The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.
Is there a way I can access the learning rate of the lbfgs
solver somehow and modify it or that question doesn't even make sense?
Upvotes: 1
Views: 4248
Reputation: 998
LBFGS is an optimization algorithm that simply does not use a learning rate. For the purpose of your school project, you should use either sgd or adam. Regarding whether it makes more sense or not, I would say that training a neural network on 20 data points doesn't make a lot of sense anyway, except for learning the basics.
LBFGS is a quasi-newton optimization method. It is based on the hypothesis that the function you seek to optimize can be approximated locally by a second order Taylor development. It roughly proceeds like this:
The difference with Newton methods is that quasi Newton methods use approximates for the Jacobian and/or Hessian matrices.
Newton and quasi-newton methods requires more smoothness from the function to optimize than the gradient descent, but converge faster. Indeed, computing the descent step with the Hessian matrix is more efficient because it can foresee the distance to the local optimum, thus not ending up oscillating around it or converging very slowly. On the other side, the gradient descent only use the Jacobian matrix (first order derivatives) to compute the direction of steepest descent and use the learning rate as the descent step.
Practically the gradient descent is used in deep learning because computing the Hessian matrix would be too expensive.
Here it makes no sense to talk about a learning rate for Newton methods (or Quasi-Newton methods), it is just not applicable.
Upvotes: 5
Reputation: 2657
Not a complete answer, but hopefully a good pointer.
The sklearn.neural_network.MLPRegressor is implemented multilayer_perceptron module on github.
by inspecting the module I noticed that differently from other solvers, scitkit implements the lbfgs algorithm in the Base class itself. So you can easily adapt it.
It seems that they don't use any learning rate, so you could adapt this code and multiply the loss by a learning rate you want to test. I'm just not totally sure if it makes sense adding a learning rate in the context of lbfgs.
I believe the loss if being used here:
opt_res = scipy.optimize.minimize(
self._loss_grad_lbfgs, packed_coef_inter,
method="L-BFGS-B", jac=True,
options={
"maxfun": self.max_fun,
"maxiter": self.max_iter,
"iprint": iprint,
"gtol": self.tol
},
the code is located in line 430 of the _multilayer_perceptron.py module
def _fit_lbfgs(self, X, y, activations, deltas, coef_grads,
intercept_grads, layer_units):
# Store meta information for the parameters
self._coef_indptr = []
self._intercept_indptr = []
start = 0
# Save sizes and indices of coefficients for faster unpacking
for i in range(self.n_layers_ - 1):
n_fan_in, n_fan_out = layer_units[i], layer_units[i + 1]
end = start + (n_fan_in * n_fan_out)
self._coef_indptr.append((start, end, (n_fan_in, n_fan_out)))
start = end
# Save sizes and indices of intercepts for faster unpacking
for i in range(self.n_layers_ - 1):
end = start + layer_units[i + 1]
self._intercept_indptr.append((start, end))
start = end
# Run LBFGS
packed_coef_inter = _pack(self.coefs_,
self.intercepts_)
if self.verbose is True or self.verbose >= 1:
iprint = 1
else:
iprint = -1
opt_res = scipy.optimize.minimize(
self._loss_grad_lbfgs, packed_coef_inter,
method="L-BFGS-B", jac=True,
options={
"maxfun": self.max_fun,
"maxiter": self.max_iter,
"iprint": iprint,
"gtol": self.tol
},
args=(X, y, activations, deltas, coef_grads, intercept_grads))
self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
self.loss_ = opt_res.fun
self._unpack(opt_res.x)
Upvotes: 0