Reputation: 21
I am trying to apply a simple optimization by using gradient descent. In particular, I want to calulate the vector of parameters (Theta) that minimize the cost function (Mean Squared Error).
The gradient descent function looks like this:
eta = 0.1 # learning rate
n_iterations = 1000
m = 100
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y) #this is the partial derivate of the cost function
theta = theta - eta * gradients
Where X_b and y are respectively the input matrix and the target vector.
Now, if I take a look at my final theta, it is always equal to [[nan], [nan]], while it should be equal to [[85.4575313 ], [ 0.11802224]] (obtained by using both np.linalg and ScikitLearn LinearRegression).
In order to get a numeric result, I have to reduce the learning rate to 0.00001 and the number of iterations to 500. By appling these changes, the results are far away from the real theta. My data, both X_b and y, are scaled using a StandardScaler.
If I try to print out theta at each iteration, I get the following (these are only few results):
...
[[2.09755838e+297]
[7.26731496e+299]]
[[-3.54990719e+300]
[-1.22992017e+303]]
[[6.00786188e+303]
[ inf]]
[[-inf]
[ nan]]
...
How to solve the problem? Is it because of the function dominium?
Thanks
Upvotes: 0
Views: 1029
Reputation: 21
I've found an error in the code. For the benefit of all the readers, the error was generated by the feature scaling part that isn't reported in the code above. The initial theta (randomly assigned) had a completely different scale comparing to the dataset and this led to the impossibility to find valid parameters for the regression.
So by using the correct scaled inputs and targets, the function does its job and converges to the values that I know are correct, as reported in my question.
As Kuedsha suggested, I tried to apply a learning schedule in order to reduce the learning rate at each iteration, even if it is not necessary in this specific case. It works, but of course it takes more iterations to converge. I think that potentially this could be a useful thing to do in a random gradient descent algorithm.
Thanks for your support
Upvotes: 2
Reputation: 21
In my personal experience, this is probably due to the learning rate you are using. If your result goes to infinity this might be because you are using a too big learning rate. Also, be sure to decrease the leaning rate (eta in your code) in each iteration as this will make sure that your solution converges. I am not sure about what would be the optimal way to do it for your particular problem but you could try something like:
eta=initial_eta/(iteration+1)
or
eta=initial_eta/sqrt(iteration+1)
Edit: in fact, as you can see in your results, the value for your parameter goes from negative to positive in each iteration and always increasing in modulus.
I think this is because when you calculate the gradient in the first iteration eta*gradient
is so large that is goes to negative value which is higher in modulus. Then, in the second iteration the gradient is even greater and eta*gradient
is therefore also greater which gives you a positive number which is also greater in modulus. This continious until you get infinity.
This is the reason why you normally have to be careful when tuning the value for the learning rate and decrease it with the iterations.
Upvotes: 0