Reputation: 8005
I am using two different ways to implement the same type of model,
Method 1
loss_function = 'mean_squared_error'
optimizer = 'Adagrad'
batch_size = 256
nr_of_epochs = 80
model= Sequential()
model.add(Conv1D(60,32, strides=1, activation='relu',padding='causal',input_shape=(64,1)))
model.add(Conv1D(80,10, strides=1, activation='relu',padding='causal'))
model.add(Conv1D(100,5, strides=1, activation='relu',padding='causal'))
model.add(MaxPooling1D(2))
model.add(Dense(300,activation='relu'))
model.add(Flatten())
model.add(Dense(1,activation='relu'))
print(model.summary())
model.compile(loss=loss_function, optimizer=optimizer,metrics=['mse','mae'])
history=model.fit(X_train, Y_train, batch_size=batch_size, validation_data=(X_val,Y_val), shuffle=True, epochs=nr_of_epochs,verbose=2)
Method 2
inputs = Input(shape=(64,1))
outX = Conv1D(60, 32, strides=1, activation='relu',padding='causal')(inputs)
outX = Conv1D(80, 10, activation='relu',padding='causal')(outX)
outX = Conv1D(100, 5, activation='relu',padding='causal')(outX)
outX = MaxPooling1D(2)(outX)
outX = Dense(300, activation='relu')(outX)
outX = Flatten()(outX)
predictions = Dense(1,activation='linear')(outX)
model = Model(inputs=[inputs],outputs=predictions)
print(model.summary())
model.compile(loss=loss_function, optimizer=optimizer,metrics=['mse','mae'])
history=model.fit(X_train, Y_train, batch_size=batch_size, validation_data=(X_val,Y_val), shuffle=True,epochs=nr_of_epochs,verbose=2)
The model architecture of both methods should be the same, please see the following images
Method 1
Method 2
Even their architecture are kind of the same, but the training process are very different when I feed them into exactly the same data set. In the first implementation, the loss function stop decreasing just after one epoch; while the second implementation has a reasonable training loss converging trend. Why it has such a larger difference?
Method 1
625s - loss: 0.0670 - mean_squared_error: 0.0670 - mean_absolute_error: 0.0647 - val_loss: 0.0653 - val_mean_squared_error: 0.0653 - val_mean_absolute_error: 0.0646
Epoch 2/120
624s - loss: 0.0647 - mean_squared_error: 0.0647 - mean_absolute_error: 0.0641 - val_loss: 0.0653 - val_mean_squared_error: 0.0653 - val_mean_absolute_error: 0.0646
Epoch 3/120
624s - loss: 0.0647 - mean_squared_error: 0.0647 - mean_absolute_error: 0.0641 - val_loss: 0.0653 - val_mean_squared_error: 0.0653 - val_mean_absolute_error: 0.0646
Epoch 4/120
625s - loss: 0.0647 - mean_squared_error: 0.0647 - mean_absolute_error: 0.0641 - val_loss: 0.0653 - val_mean_squared_error: 0.0653 - val_mean_absolute_error: 0.0646
Epoch 5/120
624s - loss: 0.0647 - mean_squared_error: 0.0647 - mean_absolute_error: 0.0641 - val_loss: 0.0653 - val_mean_squared_error: 0.0653 - val_mean_absolute_error: 0.0646
Epoch 6/120
622s - loss: 0.0647 - mean_squared_error: 0.0647 - mean_absolute_error: 0.0641 - val_loss: 0.0653 - val_mean_squared_error: 0.0653 - val_mean_absolute_error: 0.0646
Method 2
429s - loss: 0.0623 - mean_squared_error: 0.0623 - mean_absolute_error: 0.1013 - val_loss: 0.0505 - val_mean_squared_error: 0.0505 - val_mean_absolute_error: 0.1006
Epoch 2/80
429s - loss: 0.0507 - mean_squared_error: 0.0507 - mean_absolute_error: 0.0977 - val_loss: 0.0504 - val_mean_squared_error: 0.0504 - val_mean_absolute_error: 0.0988
Epoch 3/80
429s - loss: 0.0503 - mean_squared_error: 0.0503 - mean_absolute_error: 0.0964 - val_loss: 0.0498 - val_mean_squared_error: 0.0498 - val_mean_absolute_error: 0.0954
Epoch 4/80
428s - loss: 0.0501 - mean_squared_error: 0.0501 - mean_absolute_error: 0.0955 - val_loss: 0.0498 - val_mean_squared_error: 0.0498 - val_mean_absolute_error: 0.0962
Epoch 5/80
429s - loss: 0.0499 - mean_squared_error: 0.0499 - mean_absolute_error: 0.0951 - val_loss: 0.0501 - val_mean_squared_error: 0.0501 - val_mean_absolute_error: 0.0960
Epoch 6/80
430s - loss: 0.0498 - mean_squared_error: 0.0498 - mean_absolute_error: 0.0947 - val_loss: 0.0495 - val_mean_squared_error: 0.0495 - val_mean_absolute_error: 0.0941
Upvotes: 0
Views: 59
Reputation: 86600
The activation in the last layer is different: 'relu'
x 'linear'
.
This alone produces results that are very different. (The relu one will never produce negative results).
Also, there is a lot of "luck" involved, especially when using "relu" in the entire model.
Weights in each model are initialized randomly, so they're not "the same" (unless you force the weights from one into another using model.get_weights()
and model.set_weights()
). And "relu" is an activation that must be used with care. Learning rates that are too big may quickly set all results to zero, stopping learning before the model has really learned anything.
Is this a binary classification model? If so, use "sigmoid" in the last layer.
Upvotes: 2