Reputation: 4093
I am employing L1 regularization on my neural network parameters in Keras with keras.regularizers.l1(0.01)
to obtain a sparse model. I am finding that, while many of my coefficients are close to zero, few of them are actually zero.
Upon looking at the source code for the regularization, it suggests that Keras simply adds the L1 norm of the parameters to the loss function.
This would be incorrect because the parameters would almost certainly never go to zero (within floating point error) as intended with L1 regularization. The L1 norm is not differentiable when a parameter is zero, so subgradient methods need to be used where the parameters are set to zero if close enough to zero in the optimization routine. See the soft threshold operator max(0, ..)
here.
Does Tensorflow/Keras do this, or is this impractical to do with stochastic gradient descent?
EDIT: Also here is a superb blog post explaining the soft thresholding operator for L1 regularization.
Upvotes: 14
Views: 6991
Reputation: 1040
TL;DR: The formulation in deep learning frameworks are correct, but currently we don't have a powerful solver/optimizer to solve it EXACTLY with SGD or its variants. But if you use proximal optimizers, you can obtain sparse solution.
Your observation is right.
Subgradient descent has very poor convergence properties for non-smooth functions, such as the Lasso objective, since it ignores problem structure completely (it doesn't distinguish between the least squares fit and the regularization term) by just looking at subgradients of the entire objective. Intuitively, taking small steps in the direction of the (sub)gradient usually won't lead to coordinates equal to zero exactly.
tf.train.ProximalAdagradOptimizer
) can lead to sparse solutions, but you may have a try.Another simple work around is to zero out small weights (i.e.: absolute value <1e-4) after training or after each gradient descent step to force sparsity. This is just a handy heuristic approach and not theoretically rigorous.
Upvotes: 2
Reputation: 1
Keras implements L1 regularization properly, but this is not a LASSO. For the LASSO one would need a soft-thresholding function, as correctly pointed out in the original post. It would be very useful with a function similar to the keras.layers.ThresholdedReLU(theta=1.0), but with f(x) = x for x > theta or f(x) = x for x < -theta, f(x) = 0 otherwise. For the LASSO, theta would be equal to the learning rate times the regularization factor of the L1 function.
Upvotes: 0
Reputation: 40516
So despite @Joshua answer, there are three other things that are worth to mention:
0
. keras
is automatically setting it to 1
similarly to relu
case. 1e-6
are actually equal to 0
as this is float32
precision.The problem of not having most of the values set to 0
might arise due to computational reasons due to the nature of a gradient-descent based algorithm (and setting a high l1
value) because of oscillations which might occur due to gradient discontinuity. To understand imagine that for a given weight w = 0.005
your learning rate is equal to 0.01
and a gradient of the main loss is equal to 0
w.r.t. to w
. So your weight would be updated in the following manner:
w = 0.005 - 1 * 0.01 = -0.05 (because gradient is equal to 1 as w > 0),
and after the second update:
w = -0.005 + 1 * 0.01 = 0.05 (because gradient is equal to -1 as w < 0).
As you may see the absolute value of w
hasn't decreased even though you applied l1
regularization and this happened due to the nature of the gradient-based algorithm. Of course, this is simplified situation but you could experience such oscillating behavior really often when using l1
norm regularizer.
Upvotes: 6
Reputation: 2479
Keras correctly implements L1 regularization. In the context of neural networks, L1 regularization simply adds the L1 norm of the parameters to the loss function (see CS231).
While L1 regularization does encourages sparsity, it does not guarantee that output will be sparse. The parameter updates from stochastic gradient descent are inherently noisy. Thus, the probability that any given parameter is exactly 0 is vanishingly small.
However, many of the parameters of an L1 regularized network are often close to 0. A rudimentary approach would be to threshold small values to 0. There has been research to explore more advanced methods of generating sparse neural network. In this paper, the authors simultaneously prune and train a neural network to achieve 90-95% sparsity on a number of well known network architectures.
Upvotes: 2