Reputation: 24111
I have TensorFlow 1.9 and Keras 2.0.8 on my machine. When training a neural network with some toy data, the resulting training curves are very different between TensorFlow and Keras, and I do not understand why.
For the Keras implementation, the network learns well and the loss continues to decrease, whereas for the TensorFlow implementation, the network does not learn anything and the loss does not decrease. I have tried to ensure that both implementations use the same hyperparameters. Why is the behaviour so different?
The network itself has two inputs: and image, and a vector. These are then passed through their own layers, before being concatenated.
Here are my implementations.
Tensorflow:
# Create the placeholders
input1 = tf.placeholder("float", [None, 64, 64, 3])
input2 = tf.placeholder("float", [None, 4])
label = tf.placeholder("float", [None, 4])
# Build the TensorFlow network
# Input 1
x1 = tf.layers.conv2d(inputs=input1, filters=30, kernel_size=[5, 5], strides=(2, 2), padding='valid', activation=tf.nn.relu)
x1 = tf.layers.conv2d(inputs=x1, filters=30, kernel_size=[5, 5], strides=(2, 2), padding='valid', activation=tf.nn.relu)
x1 = tf.layers.flatten(x1)
x1 = tf.layers.dense(inputs=x1, units=30)
# Input 2
x2 = tf.layers.dense(inputs=input2, units=30, activation=tf.nn.relu)
# Output
x3 = tf.concat(values=[x1, x2], axis=1)
x3 = tf.layers.dense(inputs=x3, units=30)
prediction = tf.layers.dense(inputs=x3, units=4)
# Define the optimisation
loss = tf.reduce_mean(tf.square(label - prediction))
train_op = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss)
# Train the model
sess = tf.Session()
sess.run(tf.global_variables_initializer())
training_feed = {input1: training_input1_data, input2: training_input2_data, label: training_label_data}
validation_feed = {input1: validation_input1_data, input2: validation_input2_data, label: validation_label_data}
for epoch_num in range(30):
train_loss, _ = sess.run([loss, train_op], feed_dict=training_feed)
val_loss = sess.run(loss, feed_dict=validation_feed)
Keras:
# Build the keras network
# Input 1
input1 = Input(shape=(64, 64, 3), name='input1')
x1 = Conv2D(filters=30, kernel_size=5, strides=(2, 2), padding='valid', activation='relu')(input1)
x1 = Conv2D(filters=30, kernel_size=5, strides=(2, 2), padding='valid', activation='relu')(x1)
x1 = Flatten()(x1)
x1 = Dense(units=30, activation='relu')(x1)
# Input 2
input2 = Input(shape=(4,), name='input2')
x2 = Dense(units=30, activation='relu')(input2)
# Output
x3 = keras.layers.concatenate([x1, x2])
x3 = Dense(units=30, activation='relu')(x3)
prediction = Dense(units=4, activation='linear', name='output')(x3)
# Define the optimisation
model = Model(inputs=[input1, input2], outputs=[prediction])
adam = optimizers.Adam(lr=0.001)
model.compile(optimizer=adam, loss='mse')
# Train the model
training_inputs = {'input1': training_input1_data, 'input2': training_input2_data}
training_labels = {'output': training_label_data}
validation_inputs = {'input1': validation_images, 'input2': validation_state_diffs}
validation_labels = {'output': validation_label_data}
callback = PlotCallback()
model.fit(x=training_inputs, y=training_labels, validation_data=(validation_inputs, validation_labels), batch_size=len(training_label_data[0]), epochs=30)
And here are the training curves (two runs for each implementation).
Tensorflow:
Keras:
Upvotes: 3
Views: 974
Reputation: 3743
I didn't notice any difference between the two implementations of yours. Assuming there is none, I think,
First thing is that they started at different initial losses. That suggests that the initializations of the graphs are different. As you didn't mention any initializer. Looking into the documentation (tensorflow Conv2D, Keras Conv2D) I have found that the default initializers are different.
tensorflow
uses no initializer on the other hand Keras
uses
Xavier initializer.
Second thing is that (this is my assumption) tensorflow
loss is very sharply decreased initially but later didin't decrease much compared to the Keras
one. As the designed network is not very robust and not very deep, because of the bad initialization tensorflow
suffered by falling into local minima.
Thirdly, there may be some little differences between the two as the
default parameter may vary. Generally, the wrapper frameworks try to
handle some default parameters so that we need fewer tweaks to get to
the optimal weights.
I have used FastAI framework based on
pytorch
and Keras framework for a certain classification
problem using same VGG network. I have got a significant improvement in FastAI. Because
it's default parameters are recently tweaked with the latest best
practices.
I failed to notice that the batch size was different which is one of the most important hyperparameters here. @rvinas made it clear in his answer.
Upvotes: 1
Reputation: 11895
After carefully examining your implementations, I observed that all the hyperparameters match except for the batch size. I don't agree with the answer from @Ultraviolet, because the default kernel_initializer
of tf.layers.conv2d
is also Xavier (see the TF implementation of conv2d).
The learning curves don't match for the following two reasons:
The parameters from the Keras implementation (version 2) are receiving many more updates than those of the TF implementation (version 1). In version 1, you're feeding the full dataset simultaneously into the network at each epoch. This results in only 30 adam updates. In contrast, version 2 is performing 30 * ceil(len(training_label_data)/batch_size)
adam updates, with batch_size=4
.
The updates of version 2 are noisier than those of version 1, because the gradients are averaged over less samples.
Upvotes: 3