Reputation: 1249
In machine learning cost function, if we want to minimize the influence of two parameters, let's say theta3 and theta4, it seems like we have to give a large value of regularization parameter just like the equation below.
I am not quite sure why the bigger regularization parameter reduces the influence instead of increasing it. How does this function work?
Upvotes: 5
Views: 14173
Reputation: 11
Why the bigger regularization parameter reduces the influence instead of increasing it ?
The image you shown is the cost function that you have to minimize, In order to minimize the cost function you have to make θ_3 and θ_4 close to zero.
I am sure many people understand that part, but I believe the real question people confusing is: Why is θ_3 and θ_4 approach to 0 after minimized the cost function?
I once thought this is a mathematical question, however this is more of a consequence of the gradient descent algorithm.
Recall from the algorithm, the stopping criteria of of algorithm is when the change of cost function value less than a threshold, let's say the change is less than 0.01, in order for the change of the cost function value less than 0.01, the derivative of all θ to the cost function must be small correct? (so that all θs stop changing during code iteration, which means there will barely a change in cost function with all the new θs) but checkout the derivative of θ_3 and θ_4:
derivative of θ_3 =..... + 2000·θ_3
derivative of θ_4 =..... + 2000·θ_4
they have that 2000·θ_3
or 2000·θ_4
at the end of their derivative terms, I mean, θ has to be extremely small for their derivatives to have small value, so that gradient descent algorithm stop changing θ_3 or θ_4 that much, which at the end you need to plug θ_3 and θ_4 back to the cost function to see if the change of cost function is less than 0.01.
Thus, In order for the cost function to hit its minimum (it means the change of the cost function value is less than a threshold), θ_3 and θ_4 must go toward 0.
Upvotes: 0
Reputation: 1
I will try it in most simple language. i think what you are asking is, how does adding a regularization term at the end deceases the value of parameters like theta3 and theta4 here.
So, lets first assume you added this to the end of your loss function which should massively increase the loss, making the function a bit more bias compared to before. Now we will use any optimization method, lets say gradient descent here and the job of gradient descent is to find all values of theta, now remember the fact that until this point we dont any value of theta and if you solve it you will realize the the values of theta are gonna be different if you hadnt used the regularization term at the end. To be exact, its gonna be less for theta3 and theta4.
So this will make sure your hypothesis has more bias and less variance. In simple term, it will make the equation is bit worse or not as exact as before but it will generalize the equation better.
Upvotes: 0
Reputation: 431
As regularization parameter increases from 0 to infinity, the residual sum of squares in linear regression decreases ,Variance of model decreases and Bias increases .
Upvotes: 0
Reputation: 2027
Quoting from similar question's answer:
At a high level you can think of regularization parameters as applying a kind of Occam's razor that favours simple solutions. The complexity of models is often measured by the size of the model w viewed as a vector. The overall loss function as in your example above consists of an error term and a regularization term that is weighted by λ, the regularization parameter. So the regularization term penalizes complexity (regularization is sometimes also called penalty). It is useful to think what happens if you are fitting a model by gradient descent. Initially your model is very bad and most of the loss comes from the error terms, so the model is adjusted to primarily to reduce the error term. Usually the magnitude of the model vector increases as the optimization progresses. As the model is improving and the model vector is growing the regularization term becomes a more significant part of the loss. Regularization prevents the model vector growing arbitrarily for negligible reductions in the error. λ just determines the relative importance of keeping the model simple relative to reducing training error. There are different types of regularization terms in common use. The one you have, and most commonly used in SVMs, is L2 regularization. It has the side effect of spreading weight more evenly between the components of the model vector. The main alternative is L1 or lasso regularization which has the form λ∑i|wi|, ie it penalizes the sum absolute values of the model parameters. It favors concentrating the size of the model in only a few components, the opposite of L2 regularization. Generally L2 tends to be preferable for low dimensional models while lasso tends to work better for high dimensional models like text classification where it leads to sparse models, ie models with few non-zero parameters. There is also elastic net regularization, which is just a weighted combination of L1 and L2 regularization. So you have 3 terms in your loss function: error term and the 2 regularization terms each with its own regularization parameter.
Upvotes: 3
Reputation: 2121
You said that you want to minimize the influence of two parameters, theta3
and theta4
, meaning those two are both NOT important, so we are going to tell the model we want to fit by:
And here is the learning process of the model:
Given theta3
and theta4
a really big parameter lambda , when theta3
or theta4
grows, your loss functions grows heavily relatively cause they(theta3
and theta4
) both have a big multiplier(lambda), to minimize your object function(loss function), both theta3
and theta4
can only be chosen a very small value, saying that they are not important.
Upvotes: 2
Reputation: 2821
It is because that the optimum values of thetas are found by minimizing the cost function.
As you increase the regularization parameter, optimization function will have to choose a smaller theta in order to minimize the total cost.
Upvotes: 7