Reputation: 229
My first time trying to train a dataset containing 8 variables in a time-series of 20 years or so using GRU RNN. The biomass value is what I'm trying to predict based on the other variables. I'm trying first with 1 layer GRU. I'm not using softmax for the output layer. MSE is used for my cost function.
It is basic GRU with forward propagation and backward gradient update. Here are the main function I defined:
'x_t is the input training dataset with a dimension of 7572x8. So T = 7572, input_dim = 8, hidden_dim =128. y_train is my train label.'
def forward_prop_step(self, x_t,y_train, s_t1_prev,V, U, W, b, c,learning_rate):
T = x_t.shape[0]
z_t1 = np.zeros((T,self.hidden_dim))
r_t1 = np.zeros((T,self.hidden_dim))
h_t1 = np.zeros((T,self.hidden_dim))
s_t1 = np.zeros((T+1,self.hidden_dim))
o_s = np.zeros((T,self.input_dim))
for i in xrange(T):
x_e = x_t[i].T
z_t1[i] = sigmoid(U[0].dot(x_e) + W[0].dot(s_t1[i]) + b[0])#128x1
r_t1[i] = sigmoid(U[1].dot(x_e) + W[1].dot(s_t1[i]) + b[1])#128x1
h_t1[i] = np.tanh(U[2].dot(x_e) + W[2].dot(s_t1[i] * r_t1[i]) + b[2])#128x1
s_t1[i+1] = (np.ones_like(z_t1[i]) - z_t1[i]) * h_t1[i] + z_t1[i] * s_t1[i]#128x1
o_s[i] = np.dot(V,s_t1[i+1]) + c#8x1
return [o_s,z_t1,r_t1,h_t1,s_t1]
def bptt(self, x,y_train,o,z_t1,r_t1,h_t1,s_t1,V, U, W, b, c):
bptt_truncate = 360
T = x.shape[0]#length of time scale of input data (train)
dLdU = np.zeros(U.shape)
dLdV = np.zeros(V.shape)
dLdW = np.zeros(W.shape)
dLdb = np.zeros(b.shape)
dLdc = np.zeros(c.shape)
y_train_sp = np.repeat(y_train,self.input_dim)
for t in np.arange(T)[::-1]:
dLdy = 2 * (o[t] - y_train_sp[t])
dydV = s_t1[t]
dydc = 1.0
dLdV += np.outer(dLdy,dydV)
dLdc += dLdy*dydc
for i in np.arange(max(0, t-bptt_truncate), t+1)[::-30]:#every month in the past year
s_t1_pre = s_t1[i]
dydst1 = V #8x128
dst1dzt1 = -h_t1[i] + s_t1_pre #128x1
dst1dht1 = np.ones_like(z_t1[i]) - z_t1[i] #128x1
dzt1dU = np.outer(z_t1[i]*(1.0-z_t1[i]),x[i]) #128x8
#print dzt1dU.shape
dzt1dW = np.outer(z_t1[i]*(1.0-z_t1[i]),s_t1_pre) #128x128
dzt1db = z_t1[i]*(1.0-z_t1[i]) #128x1
dht1dU = np.outer((1.0-h_t1[i] ** 2),x[i]) #128x8
dht1dW = np.outer((1.0-h_t1[i] ** 2),s_t1_pre * r_t1[i]) #128x128
dht1db = 1.0-h_t1[i] ** 2 #128x1
dht1drt1 = (1.0-h_t1[i] ** 2)*(W[2].dot(s_t1_pre))#128x1
drt1dU = np.outer((r_t1[i]*(1.0-r_t1[i])),x[i]) #128x8
drt1dW = np.outer((r_t1[i]*(1.0-r_t1[i])),s_t1_pre) #128x128
drt1db = (r_t1[i]*(1.0-r_t1[i]))#128x1
dLdW[0] += np.outer(dydst1.T.dot(dLdy),dzt1dW.dot(dst1dzt1)) #128x128
dLdU[0] += np.outer(dydst1.T.dot(dLdy),dst1dzt1.dot(dzt1dU)) #128x8
dLdb[0] += (dydst1.T.dot(dLdy))*dst1dzt1*dzt1db#128x1
dLdW[1] += np.outer(dydst1.T.dot(dLdy),dst1dht1*dht1drt1).dot(drt1dW)#128x128
dLdU[1] += np.outer(dydst1.T.dot(dLdy),dst1dht1*dht1drt1).dot(drt1dU) #128x8
dLdb[1] += (dydst1.T.dot(dLdy))*dst1dht1*dht1drt1*drt1db#128x1
dLdW[2] += np.outer(dydst1.T.dot(dLdy),dht1dW.dot(dst1dht1)) #128x128
dLdU[2] += np.outer(dydst1.T.dot(dLdy),dst1dht1.dot(dht1dU))#128x8
dLdb[2] += (dydst1.T.dot(dLdy))*dst1dht1*dht1db#128x1
return [ dLdV,dLdU, dLdW, dLdb, dLdc ]
def predict( self, x):
pred = np.amax(x, axis = 1)
pred_f = relu(pred)
return pred_f
Parameters V,U,W,b,c are updated by gradient dLdV,dLdU,dLdW,dLdb,dLdc calculated by bptt.
I have tried different weight initialization (xavier or just random), tried different time truncation. But all lead to the same outcome. Probably the weight update wasn't right? The network set-up seems simple though. Really struggle on understanding the predication and translate to actual biomass too. The function predict is what I defined to translate the output layer from the GRU network to biomass value by taking the maximum value. But the output layer gives similar value for almost all time iterations. Not sure the best way to do the job though. Thanks for any help or suggestions in advance.
Upvotes: 0
Views: 72
Reputation: 2483
I doubt anyone on stackoverflow is going to debug a custom implementation of GRU for you. If you were using Tensorflow or another high level library, I might take a stab at it, or if it was a simple fully connected network, but as it is all I can do is give some advice on how to proceed with debugging.
First, it sounds like you're running a brand new implementation on your own data set right off the bat. Instead, try starting out by testing your network on trivial, synthetic data sets first. Can it learn an identity function? A response which is simply the weighted average of the three previous time stamps? And so on. It's easier to debug small simple problems. Once you know your implementation can learn things that a GRU based recurrent network should be able to learn, then you can start using your own data.
Second, this comment of yours was very insightful:
Probably the weight update wasn't right?
While it's impossible to say for sure, this is a very common - perhaps the most common - source of bugs in for backprop implementations. Andrew Ng recommends gradient checking to debug an implementation like this. Essentially, this involves numerically approximating the gradient. It's computationally inefficient but relies only on a correct implementation of forward propagation, which makes it very useful for debugging. For one, if the algorithm converges when the numerically approximated gradient is used, you can be more sure that your forward prop is correct and focus on debugging backprop. (On the other hand, if it is still does not succeed, it is likely the issue in your forward prop function.) For another, once the algorithm is working with the numerically approximated gradient then you can compare the output of your analytic gradient function with it and debug any discrepancies. This makes it a lot easier because you now know the correct answer that it should return.
Upvotes: 0