Tensorflow NN implementation not converging

Question

I'm trying to implement a simple feedforward Neural Network using only Tensorflow and it is not converging. I'm not sure if the problem is in the architecture of the network or the training process implementation. Simple 2 layer NN built using Keras seems to be converging fine:

from keras.layers import LSTM, Dense, Flatten, Conv1D
from keras import Sequential
model = Sequential()
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(21, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(np.array(train_in), np.array(train_target), epochs=10, validation_split=0.1, batch_size=16)
    Epoch 2/10 59717/59717 [==============================] - 4s 71us/sample - loss: 1.4021 - accuracy: 0.6812 - val_loss: 1.1049 - val_accuracy: 0.7066
Epoch 3/10
59717/59717 [==============================] - 4s 70us/sample - loss: 1.0942 - accuracy: 0.7321 - val_loss: 1.2269 - val_accuracy: 0.7015
Epoch 4/10
59717/59717 [==============================] - 4s 70us/sample - loss: 0.9096 - accuracy: 0.7654 - val_loss: 0.8207 - val_accuracy: 0.7905
Epoch 5/10
59717/59717 [==============================] - 4s 70us/sample - loss: 0.8373 - accuracy: 0.7790 - val_loss: 0.6863 - val_accuracy: 0.8267
Epoch 6/10
59717/59717 [==============================] - 4s 72us/sample - loss: 0.7925 - accuracy: 0.7918 - val_loss: 0.8132 - val_accuracy: 0.7929
Epoch 7/10
59717/59717 [==============================] - 4s 73us/sample - loss: 0.7916 - accuracy: 0.7925 - val_loss: 0.6749 - val_accuracy: 0.8210
Epoch 8/10
19600/59717 [========>.....................] - ETA: 2s - loss: 0.7475 - accuracy: 0.8011

Here's my implementation of the same network in Tensorflow:

tf.compat.v1.disable_eager_execution()
batch_size = 10
hid_dim = 32
output_dim = 21
features = train_x.shape[1]

x = tf.compat.v1.placeholder(tf.float32, (batch_size, features), name='x')
y = tf.compat.v1.placeholder(tf.int32, (batch_size, ), name='y')

w1 = tf.Variable(tf.compat.v1.random_normal([features, hid_dim]), dtype=tf.float32)
b1 = tf.Variable(tf.compat.v1.random_normal([hid_dim]), dtype=tf.float32)

w2 = tf.Variable(tf.compat.v1.random_normal([hid_dim, output_dim]), dtype=tf.float32)
b2 = tf.Variable(tf.compat.v1.random_normal([output_dim]), dtype=tf.float32)


h1 = tf.nn.relu(tf.matmul(x, w1) + b1)
h2 = tf.matmul(h1, w2) + b2


loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=h2, labels=y))
optimizer = tf.compat.v1.train.AdamOptimizer(0.001).minimize(loss)
pred = tf.nn.softmax(h2)

And here's my training procedure implementation. In my case batch_size is fixed so at each epoch I feed the whole dataset to the network batch by batch. I calculate loss after each batch and add it to array. After each epoch I take average of the epoch's batch losses array to get my overall epoch loss:

train_in = np.array(train_x)

train_target = np.array(train_y)
train_target = np.squeeze(train_target)

y_t = train_target

num_of_train_batches = len(train_in)/batch_size
init=tf.compat.v1.global_variables_initializer()
print('TRAIN BATCHES: ', num_of_train_batches) 
epoch_list = []
epoch_losses = [] 
epochs = 50
with tf.compat.v1.Session() as sess:
    sess.run(init)
    print('TRAINING')
    for epoch in range(epochs):
      lt = []
      ft = 0
      tt = 1

      train_losses = []
      print('EPOCH: ', epoch)
      epoch_list.append(epoch)
      # RUN WHOLE SET
      for it in range(int(num_of_train_batches)): #len(x_train)/batch_size
          # OPTIMIZE
          _, batch_loss = sess.run([optimizer, loss], feed_dict={x:train_in[ft*batch_size:tt*batch_size], 
                                                                 y:train_target[ft*batch_size:tt*batch_size]})
          train_losses.append(batch_loss)
          
          ft+=1
          tt+=1

      epoch_losses.append(np.array(train_losses).mean())

      print('EPOCH: ', epoch)
      print('LOSS: ', np.array(train_losses).mean())

TRAIN BATCHES:  2200.0
TRAINING
EPOCH:  0
EPOCH:  0
LOSS:  1370.9271
EPOCH:  1
EPOCH:  1
LOSS:  64.23466
EPOCH:  2
EPOCH:  2
LOSS:  36.015495
EPOCH:  3
EPOCH:  3
LOSS:  30.292429
EPOCH:  4
EPOCH:  4
LOSS:  26.436918
EPOCH:  5
EPOCH:  5
LOSS:  25.689302
EPOCH:  6
EPOCH:  6
LOSS:  23.730627
EPOCH:  7
EPOCH:  7
LOSS:  22.356762
EPOCH:  8
EPOCH:  8
LOSS:  21.81124

My Keras implementation reached 0.75 loss only after 8 epochs using the same number of hidden layers and hidden layer size, but my TF implementation is still showing greater than 10 loss even after 15 epochs.

Can someone please point out why is this hapenning? I'm guessing the problem has more to do with training procedure than actual NN.

All suggestions are welcomed!

Tensorflow NN implementation not converging

Answers (0)

Related Questions