Why L1 regularization works in machine Learning

Well, in machine learning, one way to prevent overfitting is to add L2 regularization, and some says that L1 regularization is better, why is that? Also i know that L1 is used to ensure the sparsity of data, what is the theoretical support for this result?

Upvotes: 1

Answers (4)

atena valizade

Reputation: 1

L1 regularization: It adds an L1 penalty that is equal to the absolute value of the magnitude of coefficient, or simply restricting the size of coefficients. For example, Lasso regression implements this method.

L1 regularization is the preferred choice when having a high number of features as it provides sparse solutions. Even, we obtain the computational advantage because features with zero coefficients can be avoided.

you can read more this website

Upvotes: 0

Aditya

Reputation: 1058

As l1 regularizer creates sparsity, so it converges faster to your minimal as compared to l2 regularizer. Now lets try to prove this. The l1 regulaizer is a absolute value function (|w_i|) which is linear function i.e whether the value is positive or negative the result is always positive. Now while solving optimization problems of any model we require SGD(stochastic gradient descent) which requires a differential function. So now we need to differentiate l1 regularizer. The differentiation of |w_i| is always constant so it takes longer steps during the updation phase in SGD.

Coming to l2 regularizer which is |w_i|^2 which is a quadratic function and its graph looks like that of a parabola whose minimum is at 0 while there is no maximum. The derivative of l2 regularizer is a linear function and it decreases slowly which means it is not constant as compared to l1 regularizer.So it takes more time to converge than l1. This answers your theoretical support question.

Coming to the usage it depends on your problem i.e. if the data has lots of features and you know most of them are useless so it is better to use l1 regularizer because it will make the values of those features as 0 and you will get a feature vector which will be easy to interpret. This is one of the use case of l1. There can be situations where you will have to use both and that regularization is termed as elastic net. These things you need to try to get the best results for your models.

Hope this helps.

Upvotes: -1

ishandutta2007

Reputation: 18214

It is well known that L1 regularization helps perform feature selection in sparse feature spaces, and that is a good practical reason to use L1 in some situations. However, beyond that particular reason I have never seen L1 to perform better than L2 in practice. If you take a look at LIBLINEAR FAQ on this issue you will see how they have not seen a practical example where L1 beats L2 and encourage users of the library to contact them if they find one. Even in a situation where you might benefit from L1's sparsity in order to do feature selection, using L2 on the remaining variables is likely to give better results than L1 by itself.

So, as Andrew Ng here explains

L1 regularized logistic regression can be effective even if there are exponentially many irrelevant features as there are training examples.

When the number of features are quite large you can give L1 a shot but L2 should always be your blind eye pick.

Even in the case when you have a strong reason to use L1 given the number of features, I would recommend going for Elastic Nets instead. Agreed this will only be a practical option if you are doing linear/logistic regression. But, in that case, Elastic Nets have proved to be (in theory and in practice) better than L1/Lasso. Elastic Nets combine L1 and L2 regularization at the "only" cost of introducing another hyperparameter to tune (see Hastie's paper for more details Page on stanford.edu).

So in shorts L1 regularization works best for feature selection in sparse feature spaces.

Upvotes: 1

Semih Korkmaz

Reputation: 1145

L1 regularization is used for sparsity. This can be beneficial especially if you are dealing with big data as L1 can generate more compressed models than L2 regularization. This is basically due to as regularization parameter increases there is a bigger chance your optima is at 0.

L2 regularization punishes big number more due to squaring. Of course, L2 is more 'elegant' in smoothness manner.

You should check this webpage

P.S.

A more mathematically comprehensive explanation may not be a good fit for this website, you can try other Stack Exchange websites for example

Upvotes: 6

Why L1 regularization works in machine Learning

Answers (4)

Related Questions