Reputation: 11
I've run into some problems while trying to implement Stochastic Gradient Descent, and basically what is happening is that my cost is growing like crazy and I don't have a clue why.
MSE implementation:
def mse(x,y,w,b):
predictions = x @ w
summed = (np.square(y - predictions - b)).mean(0)
cost = summed / 2
return cost
Gradients:
def grad_w(y,x,w,b,n_samples):
return -y @ x / n_samples + x.T @ x @ w / n_samples + b * x.mean(0)
def grad_b(y,x,w,b,n_samples):
return -y.mean(0) + x.mean(0) @ w + b
SGD Implementation:
def stochastic_gradient_descent(X,y,w,b,learning_rate=0.01,iterations=500,batch_size =100):
length = len(y)
cost_history = np.zeros(iterations)
n_batches = int(length/batch_size)
for it in range(iterations):
cost =0
indices = np.random.permutation(length)
X = X[indices]
y = y[indices]
for i in range(0,length,batch_size):
X_i = X[i:i+batch_size]
y_i = y[i:i+batch_size]
w -= learning_rate*grad_w(y_i,X_i,w,b,length)
b -= learning_rate*grad_b(y_i,X_i,w,b,length)
cost = mse(X_i,y_i,w,b)
cost_history[it] = cost
if cost_history[it] <= 0.0052: break
return w, cost_history[:it]
Random Variables:
w_true = np.array([0.2, 0.5,-0.2])
b_true = -1
first_feature = np.random.normal(0,1,1000)
second_feature = np.random.uniform(size=1000)
third_feature = np.random.normal(1,2,1000)
arrays = [first_feature,second_feature,third_feature]
x = np.stack(arrays,axis=1)
y = x @ w_true + b_true + np.random.normal(0,0.1,1000)
w = np.asarray([0.0,0.0,0.0], dtype='float64')
b = 1.0
After running this:
theta,cost_history = stochastic_gradient_descent(x,y,w,b)
print('Final cost/MSE: {:0.3f}'.format(cost_history[-1]))
I Get that:
Final cost/MSE: 3005958172614261248.000
And here is the plot
Upvotes: 1
Views: 75
Reputation: 11
Hey @TQCH and thanks for that. I've come up with a different approach to implement SGD without an inner loop and the results were also pretty sweet.
def stochastic_gradient_descent(X,y,w,b,learning_rate=0.35,iterations=3000,batch_size =100):
length = len(y)
cost_history = np.zeros(iterations)
n_batches = int(length/batch_size)
marker = 0
cost = mse(X,y,w,b)
print(cost)
for it in range(iterations):
cost =0
indices = np.random.choice(length, batch_size)
X_i = X[indices]
y_i = y[indices]
w -= learning_rate*grad_w(y_i,X_i,w,b)
b -= learning_rate*grad_b(y_i,X_i,w,b)
cost = mse(X_i,y_i,w,b)
cost_history[it] = cost
if cost_history[it] <= 0.0075 and cost_history[it] > 0.0071: marker = it
if cost <= 0.0052: break
print(f'{w}, {b}')
return w, cost_history, marker, cost
w = np.asarray([0.0,0.0,0.0], dtype='float64')
b = 1.0
theta,cost_history, marker, cost = stochastic_gradient_descent(x,y,w,b)
print(f'Number of iterations: {marker}')
print('Final cost/MSE: {:0.3f}'.format(cost))
which gave me these results:
1.9443112664859845,
[ 0.19592532 0.31735225 -0.20044424], -0.9059800816290591
Number of iterations: 68
Final cost/MSE: 0.005
But you're right I missed that I was dividing by total length of vector y and not by batch size and forgot to add batch loss!
Thanks for that!
Upvotes: 0
Reputation: 1232
Here are a few suggestions:
def stochastic_gradient_descent(X,y,w,b,learning_rate=0.01,iterations=500,batch_size =100):
length = len(y)
cost_history = np.zeros(iterations)
n_batches = int(length/batch_size)
for it in range(iterations):
cost =0
indices = np.random.permutation(length)
X = X[indices]
y = y[indices]
for i in range(0,length,batch_size):
X_i = X[i:i+batch_size]
y_i = y[i:i+batch_size]
w -= learning_rate*grad_w(y_i,X_i,w,b,len(X_i)) # the denominator should be the actual batch size
b -= learning_rate*grad_b(y_i,X_i,w,b,len(X_i))
cost += mse(X_i,y_i,w,b)*len(X_i) # add batch loss
cost_history[it] = cost/length # this is a running average of your batch losses, which is statistically more stable
if cost_history[it] <= 0.0052: break
return w, b, cost_history[:it]
The final results:
w_true = np.array([0.2, 0.5, -0.2])
b_true = -1
first_feature = np.random.normal(0,1,1000)
second_feature = np.random.uniform(size=1000)
third_feature = np.random.normal(1,2,1000)
arrays = [first_feature,second_feature,third_feature]
x = np.stack(arrays,axis=1)
y = x @ w_true + b_true + np.random.normal(0,0.1,1000)
w = np.asarray([0.0,0.0,0.0], dtype='float64')
b = 0.0
theta,bias,cost_history = stochastic_gradient_descent(x,y,w,b,learning_rate=1e-3,iterations=3000)
print("Final epoch cost/MSE: {:0.3f}".format(cost_history[-1]))
print("True final cost/MSE: {:0.3f}".format(mse(x,y,theta,bias)))
print(f"Final coefficients:\n{theta,bias}")
Upvotes: 0