user297850
user297850

Reputation: 8005

the huge difference of two different implementation for the same model

I am using two different ways to implement the same type of model,

Method 1

loss_function = 'mean_squared_error'
optimizer = 'Adagrad'
batch_size = 256
nr_of_epochs = 80

model= Sequential()
model.add(Conv1D(60,32, strides=1, activation='relu',padding='causal',input_shape=(64,1)))
model.add(Conv1D(80,10, strides=1, activation='relu',padding='causal'))
model.add(Conv1D(100,5, strides=1, activation='relu',padding='causal'))
model.add(MaxPooling1D(2))
model.add(Dense(300,activation='relu'))
model.add(Flatten())
model.add(Dense(1,activation='relu'))
print(model.summary())

model.compile(loss=loss_function, optimizer=optimizer,metrics=['mse','mae'])
history=model.fit(X_train, Y_train, batch_size=batch_size, validation_data=(X_val,Y_val), shuffle=True, epochs=nr_of_epochs,verbose=2)  

Method 2

inputs = Input(shape=(64,1))
outX = Conv1D(60, 32, strides=1, activation='relu',padding='causal')(inputs)
outX = Conv1D(80, 10, activation='relu',padding='causal')(outX)
outX = Conv1D(100, 5, activation='relu',padding='causal')(outX)
outX = MaxPooling1D(2)(outX)
outX = Dense(300, activation='relu')(outX)
outX = Flatten()(outX)
predictions = Dense(1,activation='linear')(outX)
model = Model(inputs=[inputs],outputs=predictions)
print(model.summary())

model.compile(loss=loss_function, optimizer=optimizer,metrics=['mse','mae'])
history=model.fit(X_train, Y_train, batch_size=batch_size, validation_data=(X_val,Y_val), shuffle=True,epochs=nr_of_epochs,verbose=2)   

The model architecture of both methods should be the same, please see the following images

Method 1

enter image description here

Method 2

enter image description here

Even their architecture are kind of the same, but the training process are very different when I feed them into exactly the same data set. In the first implementation, the loss function stop decreasing just after one epoch; while the second implementation has a reasonable training loss converging trend. Why it has such a larger difference?

Method 1

625s - loss: 0.0670 - mean_squared_error: 0.0670 - mean_absolute_error: 0.0647 - val_loss: 0.0653 - val_mean_squared_error: 0.0653 - val_mean_absolute_error: 0.0646                                                                                                                                  
Epoch 2/120                                                                                                                                        
624s - loss: 0.0647 - mean_squared_error: 0.0647 - mean_absolute_error: 0.0641 - val_loss: 0.0653 - val_mean_squared_error: 0.0653 - val_mean_absolute_error: 0.0646                                                                                                                                  
Epoch 3/120                                                                                                                                        
624s - loss: 0.0647 - mean_squared_error: 0.0647 - mean_absolute_error: 0.0641 - val_loss: 0.0653 - val_mean_squared_error: 0.0653 - val_mean_absolute_error: 0.0646                                                                                                                                  
Epoch 4/120                                                                                                                                        
625s - loss: 0.0647 - mean_squared_error: 0.0647 - mean_absolute_error: 0.0641 - val_loss: 0.0653 - val_mean_squared_error: 0.0653 - val_mean_absolute_error: 0.0646                                                                                                                                  
Epoch 5/120                                                                                                                                        
624s - loss: 0.0647 - mean_squared_error: 0.0647 - mean_absolute_error: 0.0641 - val_loss: 0.0653 - val_mean_squared_error: 0.0653 - val_mean_absolute_error: 0.0646                                                                                                                                  
Epoch 6/120                                                                                                                                        
622s - loss: 0.0647 - mean_squared_error: 0.0647 - mean_absolute_error: 0.0641 - val_loss: 0.0653 - val_mean_squared_error: 0.0653 - val_mean_absolute_error: 0.0646                       

Method 2

429s - loss: 0.0623 - mean_squared_error: 0.0623 - mean_absolute_error: 0.1013 - val_loss: 0.0505 - val_mean_squared_error: 0.0505 - val_mean_absolute_error: 0.1006                                                                                                                                  
Epoch 2/80                                                                                                                                         
429s - loss: 0.0507 - mean_squared_error: 0.0507 - mean_absolute_error: 0.0977 - val_loss: 0.0504 - val_mean_squared_error: 0.0504 - val_mean_absolute_error: 0.0988                                                                                                                                  
Epoch 3/80                                                                                                                                         
429s - loss: 0.0503 - mean_squared_error: 0.0503 - mean_absolute_error: 0.0964 - val_loss: 0.0498 - val_mean_squared_error: 0.0498 - val_mean_absolute_error: 0.0954                                                                                                                                  
Epoch 4/80                                                                                                                                         
428s - loss: 0.0501 - mean_squared_error: 0.0501 - mean_absolute_error: 0.0955 - val_loss: 0.0498 - val_mean_squared_error: 0.0498 - val_mean_absolute_error: 0.0962                                                                                                                                  
Epoch 5/80                                                                                                                                         
429s - loss: 0.0499 - mean_squared_error: 0.0499 - mean_absolute_error: 0.0951 - val_loss: 0.0501 - val_mean_squared_error: 0.0501 - val_mean_absolute_error: 0.0960                                                                                                                                  
Epoch 6/80                                                                                                                                         
430s - loss: 0.0498 - mean_squared_error: 0.0498 - mean_absolute_error: 0.0947 - val_loss: 0.0495 - val_mean_squared_error: 0.0495 - val_mean_absolute_error: 0.0941           

Upvotes: 0

Views: 59

Answers (1)

Daniel Möller
Daniel Möller

Reputation: 86600

The activation in the last layer is different: 'relu' x 'linear'.

This alone produces results that are very different. (The relu one will never produce negative results).

Also, there is a lot of "luck" involved, especially when using "relu" in the entire model.

Weights in each model are initialized randomly, so they're not "the same" (unless you force the weights from one into another using model.get_weights() and model.set_weights()). And "relu" is an activation that must be used with care. Learning rates that are too big may quickly set all results to zero, stopping learning before the model has really learned anything.


Is this a binary classification model? If so, use "sigmoid" in the last layer.

Upvotes: 2

Related Questions