Why slow learning in RNN implemented using for-loop?

Question

Problem settings

As a beginner of RNN, I'm currently building a 3-to-1 autocompletion RNN model for 4-letter words, where the input is a 3-letter incomplete word and the output is a single-letter which completes the word. For example, I would desire to have the following model-prediction:

input : "C", "A", "F"
output : "E"

Codes - generate dataset

To get the desired result from an RNN model, I have made an (imbalanced) dataset as follows:

import string
import numpy as np       
import tensorflow as tf
import matplotlib.pyplot as plt

alphList  = list(string.ascii_uppercase) # Define a list of alphabets
alphToNum = {n: i for i, n in enumerate(alphList)} # dic of alphabet-numbers

# Make dataset
# define words of interest
fourList = ['CARE', 'CODE', 'COME', 'CANE', 'COPE', 'FISH', 'JAZZ', 'GAME', 'WALK', 'QUIZ']

# (len(Sequence), len(Batch), len(Observation)) following tensorflow-style
first3Data = np.zeros((3, len(fourList), len(alphList)), dtype=np.int32)
last1Data  = np.zeros((len(fourList), len(alphList)), dtype=np.int32)

for idxObs, word in enumerate(fourList):
    # Make an array of one-hot vectors consisting of first 3 letters
    first3 = [alphToNum[n] for n in word[:-1]]
    first3Data[:,idxObs,:] = np.eye(len(alphList))[first3]
    # Make an array of one-hot vectors consisting of last 1 letter
    last1  = alphToNum[word[3]]
    last1Data[idxObs,:]    = np.eye(len(alphList))[last1]

So fourList contains the training data information, first3Data contains all the one-hot encoded first 3 letters of the training data, and last1Data contains all the one-hot encoded last 1 letter of the training data.

Codes - build model

Following the standard setting of 3-to-1 RNN model,I have made the following code.

# Hyperparameters
n_data        = len(fourList)
n_input       = len(alphList)  # number of input units
n_hidden      = 128            # number of hidden units
n_output      = len(alphList)  # number of output units
learning_rate = 0.01
total_epoch   = 100000

# Variables (separate version)
W_in  = tf.Variable(tf.random_normal([n_input, n_hidden]))
W_rec = tf.Variable(tf.random_normal([n_hidden, n_hidden]))
b_rec = tf.Variable(tf.random_normal([n_hidden]))
W_out = tf.Variable(tf.random_normal([n_hidden, n_output]))
b_out = tf.Variable(tf.random_normal([n_output]))

# Manual calculation of RNN output
def RNNoutput(Xinput):
    h_state    = tf.random_normal([1,n_hidden]) # initial hidden state

    for iX in Xinput:
        h_state = tf.nn.tanh(iX @ W_in + (h_state @ W_rec + b_rec))

    rnn_output = h_state @ W_out + b_out
    return(rnn_output)

Note that the Manual calculation of RNN output part basically rolls the hidden state exactly 4 times using the matrix multiplication and the tanh activation function as follows:

tf.nn.tanh(iX @ W_in + (h_state @ W_rec + b_rec))

Here, every time the whole data is passed, one epoch is completed. Thus I initialize the h_state every time I pass the data. Additionally, note that I have not used a placeholder, which may be a cause of the learning instability.

Codes - train

I have used the following code to train the network.

# Cost / optimizer definition
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=RNNoutput(first3Data),
                                                                 labels=last1Data))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

# Train and keep track of the loss history
sess = tf.Session()
sess.run(tf.global_variables_initializer())

lossHistory = []
for epoch in range(total_epoch):
    _, loss = sess.run([optimizer, cost])
    lossHistory.append(loss)

Question

The resulting learning curve looks as follows. Indeed, it shows an exponential decay.

However, for me it looks too wiggly for this kind of simple example, showing some instabilities even in the late period of the learning.

plt.plot(range(total_epoch), lossHistory)
plt.show()

Possible explanations?

I think the learning curve should show a square-like stable decay pattern as expected using tensorflow built-in functions (*). But I think this instability may be explained plausibly as follows:

Instability in random initialization of parameters
Numerical instability due to the successive addition when defining RNNoutput
Not using a tensor for loop but using the for loop directly in data

But I don't think any of these played a crucial role. Is there any other solution to help me out?

(*) I have seen a nearly square-patterned loss decay using tensorflow built-in functions for simple RNN. But sorry that I have not included the results to be compared, since I run out of time... I think I can edit shortly.

Open Season · Accepted Answer

This modification where the initial state is set to be zero seems to solve the problem.

# Variables (separate version)
W_in  = tf.Variable(tf.random_normal([n_input, n_hidden]))
W_rec = tf.Variable(tf.random_normal([n_hidden, n_hidden]))
b_rec = tf.Variable(tf.random_normal([n_hidden]))
W_out = tf.Variable(tf.random_normal([n_hidden, n_output]))
b_out = tf.Variable(tf.random_normal([n_output]))
h_init = tf.zeros([1,n_hidden])

# Manual calculation of RNN output
def RNNoutput(Xinput):
    h_state    =  h_init # initial hidden state

    for iX in Xinput:
        h_state = tf.nn.tanh(iX @ W_in + (h_state @ W_rec + b_rec))

    rnn_output = h_state @ W_out + b_out
    return(rnn_output)