Why is l2 regularization always an addition?

Question

I am reading through info about the l2 regularization of neural network weights. So far I understood, the intention is that weights get pushed towards zero the larger they become i.e. large weights receive a high penalty while lower ones are less severely punished.

The formulae is usually:

new_weight = weight * update + lambda * sum(squared(weights))

My question: Why is this always positive? If the weight is already positive the l2 will never decrease it but makes things worse and pushes the weight away from zero. This is the case in almost all formulae I saw so far, why is that?

DocDriven · Accepted Answer

The formula you presented is very vague about what an 'update' is.

First, what is regularization? Generally speaking, the formula for L2 regularization is:

$C = C_0 + \frac{\lambda}{2n} * \sum_w{w^2}$

(n is traing set size, lambda scales the influence of the L2 term)

You add an extra term to your original cost function $C_0$ , which will be also partially derived for the update of the weights. Intuitively, this punishes big weights, so the algorithm tries to find the best tradeoff between small weights and the chosen cost function. Small weights are associated with finding a simpler model, as the behavior of the network does not change much when given some random outlying values. This means it filters out the noise of the data and comes down to learn the simplest possible solution. In other words, it reduces overfitting.

Going towards your question, let's derive the update rule. For any weight in the graph, we get

$\frac{\partial{C}}{\partial{w}} = \frac{\partial{C_0}}{\partial{w}}+ \frac{\lambda}{n} * w$

Thus, the update formula for the weights can be written as (eta is the learning rate)

$w \rightarrow w - \eta \frac{\partial{C}}{\partial{w}} = w - \eta [\frac{\partial{C_0}}{\partial{w}} + \frac{\lambda}{n}w]$

$w \rightarrow [1 - \frac{\eta\lambda}{n}]w - \eta \frac{\partial{C_0}}{\partial{w}}$

Considering only the first term, the weight seems to be driven towards zero regardless of what's happening. But the second term can add to the weight, if the partial derivative is negative. All in all, weights can be positive or negative, as you cannot derive a constraint from this expression. The same applies to the derivatives. Think of fitting a line with a negative slope: the weight has to be negative. To answer your question, neither the derivative of regularized cost nor the weights have to be positive all the time.

If you need more clarification, leave a comment.

Why is l2 regularization always an addition?

Answers (1)

Related Questions