Reputation: 136
I am working with multivariate linear regression and using stochastic gradient descent to optimize.
Working on this dataSet http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/
for every run all hyperParameters and all remaining things are same, epochs=200 and alpha=0.1
when I first run then I got final_cost=0.0591, when I run the program again keeping everything same I got final_cost=1.0056 , running again keeping everything same I got final_cost=0.8214 , running again final_cost=15.9591, running again final_cost=2.3162 and so on and on...
As you can see that keeping everything same and running, again and again, each time the final cost changes by large amount sometimes so large like from 0.8 to direct 15.9 , 0.05 to direct 1.00 and not only this the graph of final cost after every epoch within the same run is every zigzag unlike in batch GD in which the cost graph decreases smoothly.
I can't understand that why SGD is behaving so weirdly, different results in the different run.
I tried the same with batch GD and everything is absolutely fine and smooth as per expectations. In case of batch GD no matter how many times I run the same code the result is exactly the same every time.
But in the case of SGD, I literally cried,
class Abalone :
def __init__(self,df,epochs=200,miniBatchSize=250,alpha=0.1) :
self.df = df.dropna()
self.epochs = epochs
self.miniBatchSize = miniBatchSize
self.alpha = alpha
print("abalone created")
self.modelTheData()
def modelTheData(self) :
self.TOTAL_ATTR = len(self.df.columns) - 1
self.TOTAL_DATA_LENGTH = len(self.df.index)
self.df_trainingData =
df.drop(df.index[int(self.TOTAL_DATA_LENGTH * 0.6):])
self.TRAINING_DATA_SIZE = len(self.df_trainingData)
self.df_testingData =
df.drop(df.index[:int(self.TOTAL_DATA_LENGTH * 0.6)])
self.TESTING_DATA_SIZE = len(self.df_testingData)
self.miniBatchSize = int(self.TRAINING_DATA_SIZE / 10)
self.thetaVect = np.zeros((self.TOTAL_ATTR+1,1),dtype=float)
self.stochasticGradientDescent()
def stochasticGradientDescent(self) :
self.finalCostArr = np.array([])
startTime = time.time()
for i in range(self.epochs) :
self.df_trainingData =
self.df_trainingData.sample(frac=1).reset_index(drop=True)
miniBatches=[self.df_trainingData.loc[x:x+self.miniBatchSize-
((x+self.miniBatchSize)/(self.TRAINING_DATA_SIZE-1)),:]
for x in range(0,self.TRAINING_DATA_SIZE,self.miniBatchSize)]
self.epochCostArr = np.array([])
for j in miniBatches :
tempMat = j.values
self.actualValVect = tempMat[ : , self.TOTAL_ATTR:]
tempMat = tempMat[ : , :self.TOTAL_ATTR]
self.desMat = np.append(
np.ones((len(j.index),1),dtype=float) , tempMat , 1 )
del tempMat
self.trainData()
currCost = self.costEvaluation()
self.epochCostArr = np.append(self.epochCostArr,currCost)
self.finalCostArr = np.append(self.finalCostArr,
self.epochCostArr[len(miniBatches)-1])
endTime = time.time()
print(f"execution time : {endTime-startTime}")
self.graphEvaluation()
print(f"final cost :
{self.finalCostArr[len(self.finalCostArr)-1]}")
print(self.thetaVect)
def trainData(self) :
self.predictedValVect = self.predictResult()
diffVect = self.predictedValVect - self.actualValVect
partialDerivativeVect = np.matmul(self.desMat.T , diffVect)
self.thetaVect -=
(self.alpha/len(self.desMat))*partialDerivativeVect
def predictResult(self) :
return np.matmul(self.desMat,self.thetaVect)
def costEvaluation(self) :
cost = sum((self.predictedValVect - self.actualValVect)**2)
return cost / (2*len(self.actualValVect))
def graphEvaluation(self) :
plt.title("cost at end of all epochs")
x = range(len(self.epochCostArr))
y = self.epochCostArr
plt.plot(x,y)
plt.xlabel("iterations")
plt.ylabel("cost")
plt.show()
I kept epochs=200 and alpha=0.1 for all runs but I got a totally different result in each run.
The vector mentioned below is the theta vector where the first entry is the bias and remaining are weights
RUN 1 =>>
[[ 5.26020144]
[ -0.48787333]
[ 4.36479114]
[ 4.56848299]
[ 2.90299436]
[ 3.85349625]
[-10.61906207]
[ -0.93178027]
[ 8.79943389]]
final cost : 0.05917831328836957
RUN 2 =>>
[[ 5.18355814]
[ -0.56072668]
[ 4.32621647]
[ 4.58803884]
[ 2.89157598]
[ 3.7465471 ]
[-10.75751065]
[ -1.03302031]
[ 8.87559247]]
final cost: 1.0056239103948563
RUN 3 =>>
[[ 5.12836056]
[ -0.43672936]
[ 4.25664898]
[ 4.53397465]
[ 2.87847224]
[ 3.74693215]
[-10.73960775]
[ -1.00461585]
[ 8.85225402]]
final cost : 0.8214901206702101
RUN 4 =>>
[[ 5.38794798]
[ 0.23695412]
[ 4.43522951]
[ 4.66093372]
[ 2.9460605 ]
[ 4.13390252]
[-10.60071883]
[ -0.9230675 ]
[ 8.87229324]]
final cost: 15.959132174895712
RUN 5 =>>
[[ 5.19643132]
[ -0.76882106]
[ 4.35445135]
[ 4.58782119]
[ 2.8908931 ]
[ 3.63693031]
[-10.83291949]
[ -1.05709616]
[ 8.865904 ]]
final cost: 2.3162151072779804
I am unable to figure out what is going Wrong. Does SGD behave like this or I did some stupidity while transforming my code from batch GD to SGD. And if SGD behaves like this then how I get to know that how many times I have to rerun because I am not so lucky that every time in the first run I got such a small cost like 0.05 sometimes the first run gives cost around 10.5 sometimes 0.6 and maybe rerunning it a lot of times I got cost even smaller than 0.05.
when I approached the exact same problem with exact same code and hyperParameters just replacing the SGD function with normal batch GD I get the expected result i.e, after each iteration over the same data my cost is decreasing smoothly i.e., a monotonic decreasing function and no matter how many times I rerun the same program I got exactly the same result as this is very obvious.
"keeping everything same but using batch GD for epochs=20000 and alpha=0.1 I got final_cost=2.7474"
def BatchGradientDescent(self) :
self.costArr = np.array([])
startTime = time.time()
for i in range(self.epochs) :
tempMat = self.df_trainingData.values
self.actualValVect = tempMat[ : , self.TOTAL_ATTR:]
tempMat = tempMat[ : , :self.TOTAL_ATTR]
self.desMat = np.append( np.ones((self.TRAINING_DATA_SIZE,1),dtype=float) , tempMat , 1 )
del tempMat
self.trainData()
if i%100 == 0 :
currCost = self.costEvaluation()
self.costArr = np.append(self.costArr,currCost)
endTime = time.time()
print(f"execution time : {endTime - startTime} seconds")
self.graphEvaluation()
print(self.thetaVect)
print(f"final cost : {self.costArr[len(self.costArr)-1]}")
SomeBody help me figure out What actually is going on. Every opinion/solution is big revenue for me in this new field :)
Upvotes: 1
Views: 422
Reputation: 1694
You missed the most important and only difference between GD ("Gradient Descent") and SGD ("Stochastic Gradient Descent").
Stochasticity - Literally means "the quality of lacking any predictable order or plan". Meaning randomness.
Which means that while in the GD algorithm, the order of the samples in each epoch remains constant, in SGD the order is randomly shuffled at the beginning of every epochs. So every run of GD with the same initialization and hyperparameters will produce the exact same results, while SGD will most defiantly not (as you have experienced).
The reason for using stochasticity is to prevent the model from memorizing the training samples (which will results in overfitting, where accuracy on the training set will be high but accuracy on unseen samples will be bad).
Now regarding to the big differences in final cost values between runs at your case, my guess is that your learning rate is too high. You can use a lower constant value, or better yet, use a decaying learning rate (which gets lower as epochs get higher).
Upvotes: 1