Stochastic Gradient Descent implementation in Python from scratch. is the implementation correct?

Question

I know this would seem similar to a lot of questions asked previously on the same topic. I have surveyed most of them but they don't quite answer my question. My problem is that my gradient is not converging to optima, it is rather diverging and oscillating at even very low values of alpha.

My data generation function is below

X = [[float(np.random.randn(1)) for i in range(0,100)] for j in range(0,5)]
X = np.array(X).transpose()
Y = [float(0) for i in range(0,100)]
Y = 2*X[:,0] + 3*X[:,1] + 1*X[:,2] + 4*X[:,3] + 1*X[:,4] + 5
fig, ax = plt.subplots(1,5)
fig.set_size_inches(20,5)
k = 0
for j in range(0,5):
    sns.scatterplot(X[:,k],Y,ax=ax[j])
    k += 1

My SGD implementation is as below

def multilinreg(X,Y,epsilon = 0.000001,alpha = 0.01,K = 20):
    Xnot = [[1] for i in range(0,len(X))]
    Xnot = np.array(Xnot)
    X = np.append(Xnot,X, axis = 1)
    vars = X.shape[1]
    W = []
    W = [np.random.normal(1) for i in range(vars)]
    W = np.array(W)
    J = 0
    for i in range(len(X)):
      Yunit = 0
      for j in range(vars):
        Yunit = Yunit + X[i,j] * W[j]
        J = J + (0.5/(len(X)))*((Y[i]-Yunit)**2)
    err = 1
    iter = 0
    Weights = []
    Weights.append(W)
    Costs = []
    while err > epsilon:
      index = [np.random.randint(len(Y)) for i in range(K)]
      Xsample, Ysample = X[index,:], Y[index]
      m =len(Xsample)
      Ypredsample = []
      for i in range(len(Xsample)):
        Yunit = 0
        for j in range(vars):
          Yunit = Yunit + X[i,j] * W[j]
        Ypredsample.append(Yunit)
      Ypredsample = np.array(Ypredsample)
      for i in range(len(Xsample)):
        for j in range(vars):
          gradJunit = (-1)*(Xsample[i,j]*(Ysample[i] - Ypredsample[i]))
          W[j] = W[j] - alpha*gradJunit
      Jnew = 0
      for i in range(len(Xsample)):
        Yunit = 0
        for j in range(vars):
          Yunit = Yunit + Xsample[i,j]*W[j]
          Jnew = Jnew + (0.5/(len(Xsample)))*((Ysample[i]-Yunit)**2)
      Weights.append(W)
      err = abs(float(Jnew - J))
      J = Jnew 
      Costs.append(J)
      iter += 1
      if iter % 1000 == 0:
        print(iter)
        print(J)
    Costs = np.array(Costs)
    Ypred = []
    for i in range(len(X)):
      Yunit = 0
      for j in range(vars):
        Yunit = Yunit + X[i,j] * W[j]
      Ypred.append(Yunit)
    Ypred = np.array(Ypred)
    return Ypred, iter, Costs, W

The hyperparamaters are as below

epsilon = 1*(10)**(-20)
alpha = 0.0000001
K = 50

I don't think that it is a data issue.I am using a fairly straightforward linear function.

I think it is the equations but I have double checked them as well and they seem to be fine to me.

Tristan Nemoz · Accepted Answer

Several things are to be corrected in your implementation (most of them for efficiency reasons). Of course, you would gain time by simply defining w = np.array([5, 2, 3, 1, 4, 1]), but this does not answer the question as to why your SGD implementation does not work.

First of all, you define X by doing:

X = [[float(np.random.randn(1)) for i in range(0,100)] for j in range(0,5)]
X = np.array(X).transpose()

A faster way of doing this operation is simply by doing:

X = np.random.randn(100, 5)

Then, you define Y:

Y = [float(0) for i in range(0,100)]
Y = 2*X[:,0] + 3*X[:,1] + 1*X[:,2] + 4*X[:,3] + 1*X[:,4] + 5

The first initialisation Y = [float(0) for i in range(0,100)] is useless, since you instantly override Y with the second line. A more condensed way of writing this line could also have been:

Y = X @ np.array([2, 3, 1, 4, 1]) + 5

Now, concerning your SGD implementation. The lines:

    Xnot = [[1] for i in range(0,len(X))]
    Xnot = np.array(Xnot)
    X = np.append(Xnot,X, axis = 1)

can be rewritten more efficiently as:

    X = np.hstack((np.ones(len(X)).reshape(-1, 1), X))

Similarly, the lines

    W = []
    W = [np.random.normal(1) for i in range(vars)]
    W = np.array(W)

can be rewritten using numpy functions. Note than the first line W = [] is useless, since you override W immediately after without using it. np.random.normal can directly generate more than 1 sample using the size keyword argument. Plus, note that when using np.random.normal(1), you're sampling from the normal distribution with mean 1 and std 1, while you probably want to sample from the normal distribution with mean 0 and std 1. Hence, you should define:

    W = np.random.normal(size=vars)

Yunit is the prediction you make using W. By definition, you can compute it by doing the following:

    Yunit = X @ W

which avoids the nested for loops. The way you compute J is strange though. If I'm not mistaken, J corresponds to your loss function. However, the formula for J, assuming a MSE loss is J = 0.5 * sum from k=1 to len(X) of (y_k - w*x_k) ** 2. Hence, these two nested for loops can be rewritten as:

    Yunit = X @ W
    J = 0.5 * np.sum((Y - Yunit) ** 2)

As a side remark: naming err that way may me misleading, since the error is generally the cost, while it denotes the progress made at each step here. The lines:

    Weights = []
    Weights.append(W)

can be rewritten as:

   Weights = [W]

It would be logic also to add J to your Costs list, since this is the one which corresponds to W:

    Costs = [J]

Since you want to perform a stochastic gradient descent, there is no need to pick at random which samples you want to take from your dataset. You have two choices: either you update your weights at each sample, or you can compute the gradient of J w.r.t. your weights. The latter is a bit simpler to implement and generally converges more gracefully than the former. However, since you chose the former, this is the one I'll be working with. Note that even in this version, you don't have to pick your samples at random, but I'll be using the same method as you since this should also work. Concerning your sampling, I think it is better to ensure that you don't take the same index twice though. Hence, you may want to define index like this:

    index = np.random.choice(np.arange(len(Y)), size=K, replace=False)

m is useless, since it will always be equal to K in this case. You should keep it if you perform a sampling without ensuring that you don't have the same index twice though. If you want to perform a sampling without checking you sampled the same index twice, just put replace=True in the choice function.

Once again, you can use matrix multiplication to compute Yunit more efficiently. Hence, you can replace:

      Ypredsample = []
      for i in range(len(Xsample)):
        Yunit = 0
        for j in range(vars):
          Yunit = Yunit + X[i,j] * W[j]
        Ypredsample.append(Yunit)

by:

    Ypredsample = X @ W

Similarly, you can compute your weights update using numpy functions. Hence, you can replace:

      for i in range(len(Xsample)):
        for j in range(vars):
          gradJunit = (-1)*(Xsample[i,j]*(Ysample[i] - Ypredsample[i]))
          W[j] = W[j] - alpha*gradJunit

by:

    W -= alpha * np.sum((Ypredsample - Ysample).reshape(-1, 1) * Xsample, axis=0)

Like before, computing your cost can be done using matrix multiplication. Note however that you should compute J over your whole dataset. Hence, you should replace:

      Jnew = 0
      for i in range(len(Xsample)):
        Yunit = 0
        for j in range(vars):
          Yunit = Yunit + Xsample[i,j]*W[j]
          Jnew = Jnew + (0.5/(len(Xsample)))*((Ysample[i]-Yunit)**2)

by:

   Jnew = 0.5 * np.sum((Y - X @ W) ** 2)

Finally, you can use matrix multiplication to make your predictions. Hence, your final code should look like this:

import numpy as np

X = np.random.randn(100, 5)
Y = X @ np.array([2, 3, 1, 4, 1]) + 5

def multilinreg(X, Y, epsilon=0.00001, alpha=0.01, K=20):
    X = np.hstack((np.ones(len(X)).reshape(-1, 1), X))
    vars = X.shape[1]
    W = np.random.normal(size=vars)
    Yunit = X @ W
    J = 0.5 * np.sum((Y - Yunit) ** 2)
    err = 1
    Weights = [W]
    Costs = [J]
    iter = 0

    while err > epsilon:
        index = np.random.choice(np.arange(len(Y)), size=K, replace=False)
        Xsample, Ysample = X[index], Y[index]
        Ypredsample = Xsample @ W
        W -= alpha * np.sum((Ypredsample - Ysample).reshape(-1,1) * Xsample, axis=0)
        Jnew = 0.5 * np.sum((Y - X @ W) ** 2)
        Weights.append(Jnew)
        err = abs(Jnew - J)
        J = Jnew
        Costs.append(J)
        iter += 1

        if iter % 10 == 0:
            print(iter)
            print(J)

    Costs = np.array(Costs)
    Ypred = X @ W
    return Ypred, iter, Costs, W

Running it returns W=array([4.99956786, 2.00023614, 3.00000213, 1.00034205, 3.99963732, 1.00063196]) in 61 iterations with a final cost of 3.05e-05.

Now that we know that this code is correct, we can use it to determine where yours went wrong. In this piece of code:

      for i in range(len(Xsample)):
        Yunit = 0
        for j in range(vars):
          Yunit = Yunit + X[i,j] * W[j]
        Ypredsample.append(Yunit)
      Ypredsample = np.array(Ypredsample)

you use X[i, j] instead of Xsample[i, j], which makes no sense. Plus, if you print W along with J and iter in your loop, you can see that the program finds the correct W quite quickly (once the previous fix has been made), but does not stop, probably because J is not correctly computed. The error is that this line:

Jnew = Jnew + (0.5/(len(Xsample)))*((Ysample[i]-Yunit)**2)

isn't correctly indented. Indeed, it is not supposed to be part of the for j in range(vars) loop, but is supposed to be part of the for i in range(len(Xsample)) loop only, like this:

      Jnew = 0
      for i in range(len(Xsample)):
        Yunit = 0
        for j in range(vars):
          Yunit = Yunit + Xsample[i,j]*W[j]
        Jnew = Jnew + (0.5/(len(Xsample)))*((Ysample[i]-Yunit)**2)

By correcting this, your code works correctly. This error is also present at the beginning of your program but does not affect it as long as more than two iterations are done.

Stochastic Gradient Descent implementation in Python from scratch. is the implementation correct?

Answers (1)

Related Questions