Karnivaurus
Karnivaurus

Reputation: 24111

Different behaviour between same implementations of TensorFlow and Keras

I have TensorFlow 1.9 and Keras 2.0.8 on my machine. When training a neural network with some toy data, the resulting training curves are very different between TensorFlow and Keras, and I do not understand why.

For the Keras implementation, the network learns well and the loss continues to decrease, whereas for the TensorFlow implementation, the network does not learn anything and the loss does not decrease. I have tried to ensure that both implementations use the same hyperparameters. Why is the behaviour so different?

The network itself has two inputs: and image, and a vector. These are then passed through their own layers, before being concatenated.

Here are my implementations.

Tensorflow:

# Create the placeholders
input1 = tf.placeholder("float", [None, 64, 64, 3])
input2 = tf.placeholder("float", [None, 4])
label = tf.placeholder("float", [None, 4])

# Build the TensorFlow network
# Input 1
x1 = tf.layers.conv2d(inputs=input1, filters=30, kernel_size=[5, 5], strides=(2, 2), padding='valid', activation=tf.nn.relu)
x1 = tf.layers.conv2d(inputs=x1, filters=30, kernel_size=[5, 5], strides=(2, 2), padding='valid', activation=tf.nn.relu)
x1 = tf.layers.flatten(x1)
x1 = tf.layers.dense(inputs=x1, units=30)
# Input 2
x2 = tf.layers.dense(inputs=input2, units=30, activation=tf.nn.relu)
# Output
x3 = tf.concat(values=[x1, x2], axis=1)
x3 = tf.layers.dense(inputs=x3, units=30)
prediction = tf.layers.dense(inputs=x3, units=4)

# Define the optimisation
loss = tf.reduce_mean(tf.square(label - prediction))
train_op = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss)

# Train the model
sess = tf.Session()
sess.run(tf.global_variables_initializer())
training_feed = {input1: training_input1_data, input2: training_input2_data, label: training_label_data}
validation_feed = {input1: validation_input1_data, input2: validation_input2_data, label: validation_label_data}
for epoch_num in range(30):
    train_loss, _ = sess.run([loss, train_op], feed_dict=training_feed)
    val_loss = sess.run(loss, feed_dict=validation_feed)

Keras:

# Build the keras network
# Input 1
input1 = Input(shape=(64, 64, 3), name='input1')
x1 = Conv2D(filters=30, kernel_size=5, strides=(2, 2), padding='valid', activation='relu')(input1)
x1 = Conv2D(filters=30, kernel_size=5, strides=(2, 2), padding='valid', activation='relu')(x1)
x1 = Flatten()(x1)
x1 = Dense(units=30, activation='relu')(x1)
# Input 2
input2 = Input(shape=(4,), name='input2')
x2 = Dense(units=30, activation='relu')(input2)
# Output
x3 = keras.layers.concatenate([x1, x2])
x3 = Dense(units=30, activation='relu')(x3)
prediction = Dense(units=4, activation='linear', name='output')(x3)

# Define the optimisation
model = Model(inputs=[input1, input2], outputs=[prediction])
adam = optimizers.Adam(lr=0.001)
model.compile(optimizer=adam, loss='mse')

# Train the model
training_inputs = {'input1': training_input1_data, 'input2': training_input2_data}
training_labels = {'output': training_label_data}
validation_inputs = {'input1': validation_images, 'input2': validation_state_diffs}
validation_labels = {'output': validation_label_data}
callback = PlotCallback()
model.fit(x=training_inputs, y=training_labels, validation_data=(validation_inputs, validation_labels), batch_size=len(training_label_data[0]), epochs=30)

And here are the training curves (two runs for each implementation).

Tensorflow:

enter image description here enter image description here

Keras:

enter image description here enter image description here

Upvotes: 3

Views: 974

Answers (2)

Sumsuddin Shojib
Sumsuddin Shojib

Reputation: 3743

I didn't notice any difference between the two implementations of yours. Assuming there is none, I think,

  • First thing is that they started at different initial losses. That suggests that the initializations of the graphs are different. As you didn't mention any initializer. Looking into the documentation (tensorflow Conv2D, Keras Conv2D) I have found that the default initializers are different.

    tensorflow uses no initializer on the other hand Keras uses Xavier initializer.

  • Second thing is that (this is my assumption) tensorflow loss is very sharply decreased initially but later didin't decrease much compared to the Keras one. As the designed network is not very robust and not very deep, because of the bad initialization tensorflow suffered by falling into local minima.

  • Thirdly, there may be some little differences between the two as the default parameter may vary. Generally, the wrapper frameworks try to handle some default parameters so that we need fewer tweaks to get to the optimal weights.
    I have used FastAI framework based on pytorch and Keras framework for a certain classification problem using same VGG network. I have got a significant improvement in FastAI. Because it's default parameters are recently tweaked with the latest best practices.

Edit:

I failed to notice that the batch size was different which is one of the most important hyperparameters here. @rvinas made it clear in his answer.

Upvotes: 1

rvinas
rvinas

Reputation: 11895

After carefully examining your implementations, I observed that all the hyperparameters match except for the batch size. I don't agree with the answer from @Ultraviolet, because the default kernel_initializer of tf.layers.conv2d is also Xavier (see the TF implementation of conv2d).

The learning curves don't match for the following two reasons:

  1. The parameters from the Keras implementation (version 2) are receiving many more updates than those of the TF implementation (version 1). In version 1, you're feeding the full dataset simultaneously into the network at each epoch. This results in only 30 adam updates. In contrast, version 2 is performing 30 * ceil(len(training_label_data)/batch_size) adam updates, with batch_size=4.

  2. The updates of version 2 are noisier than those of version 1, because the gradients are averaged over less samples.

Upvotes: 3

Related Questions