Reputation: 379
I'm using keras-bert for classification. On some datasets, it runs well and calculates the loss, while on others the loss is NaN
.
The different datasets are similar in that they are augmented versions of the original one. Working with keras-bert, the original data and some augmented versions of the data run well while the other augmented versions of data don't run well.
When I use a regular one-layer BiLSTM
on the augmented versions of data that don't run well with keras-bert, it works out fine which means I can rule out the possibility of the data being faulty or containing spurious values that may affect the way the loss is calculated.
The data in working with has three classes.
I'm using bert based uncased
!wget -q https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Can anyone give me pointers as to why the loss is nan?
inputs = model.inputs[:2]
dense = model.layers[-3].output
outputs = keras.layers.Dense(3, activation='sigmoid', kernel_initializer=keras.initializers.TruncatedNormal(stddev=0.02),name = 'real_output')(dense)
decay_steps, warmup_steps = calc_train_steps(train_y.shape[0], batch_size=BATCH_SIZE,epochs=EPOCHS,)
#(decay_steps=decay_steps, warmup_steps=warmup_steps, lr=LR)
model = keras.models.Model(inputs, outputs)
model.compile(AdamWarmup(decay_steps=decay_steps, warmup_steps=warmup_steps, lr=LR), loss='sparse_categorical_crossentropy',metrics=['sparse_categorical_accuracy'])
sess = tf.compat.v1.keras.backend.get_session()
uninitialized_variables = set([i.decode('ascii') for i in sess.run(tf.compat.v1.report_uninitialized_variables ())])
init_op = tf.compat.v1.variables_initializer([v for v in tf.compat.v1.global_variables() if v.name.split(':')[0] in uninitialized_variables])
sess.run(init_op)
model.fit(train_x,train_y,epochs=EPOCHS,batch_size=BATCH_SIZE)
Train on 20342 samples
Epoch 1/10
20342/20342 [==============================] - 239s 12ms/sample - loss: nan - sparse_categorical_accuracy: 0.5572
Epoch 2/10
20342/20342 [==============================] - 225s 11ms/sample - loss: nan - sparse_categorical_accuracy: 0.2082
Epoch 3/10
20342/20342 [==============================] - 225s 11ms/sample - loss: nan - sparse_categorical_accuracy: 0.2081
Epoch 4/10
20342/20342 [==============================] - 225s 11ms/sample - loss: nan - sparse_categorical_accuracy: 0.2082
Epoch 5/10
20342/20342 [==============================] - 225s 11ms/sample - loss: nan - sparse_categorical_accuracy: 0.2082
Epoch 6/10
20342/20342 [==============================] - 225s 11ms/sample - loss: nan - sparse_categorical_accuracy: 0.2082
Epoch 7/10
20342/20342 [==============================] - 225s 11ms/sample - loss: nan - sparse_categorical_accuracy: 0.2082
Epoch 8/10
20342/20342 [==============================] - 225s 11ms/sample - loss: nan - sparse_categorical_accuracy: 0.2081
Epoch 9/10
20342/20342 [==============================] - 225s 11ms/sample - loss: nan - sparse_categorical_accuracy: 0.2082
Epoch 10/10
20342/20342 [==============================] - 225s 11ms/sample - loss: nan - sparse_categorical_accuracy: 0.2082
<tensorflow.python.keras.callbacks.History at 0x7f1caf9b0f90>
Also, I'm running this on Google Colab with tensorflow 2.3.0
and keras 2.4.3
UPDATE
I looked the data that was causing this issue again and i realised that one of the target labels were missing. I might have mistakenly edited it. Once i fixed it, the loss is NaN problem dissappeared. However, i'll be awarding the 50 points to the answer i got because it got me to think better about my code. Thanks.
Upvotes: 1
Views: 651
Reputation: 17219
I noticed one issue in your code but I'm not sure if this the main cause; better if you can possibly provide some reproducible code.
In your above code snippet, you set sigmoid
in your last layer activation with unit < 1
which indicate the problem dataset is probably multi-label and that's why the loss function should be binary_crossentropy
but you set sparse_categorical_crossentropy
which is typical uses multi-class problem and with integer labels.
outputs = keras.layers.Dense(3, activation='sigmoid',
kernel_initializer=keras.initializers.TruncatedNormal(stddev=0.02),
name = 'real_output')(dense)
model = keras.models.Model(inputs, outputs)
model.compile(AdamWarmup(decay_steps=decay_steps,
warmup_steps=warmup_steps, lr=LR),
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'])
So, if your problem data set is a multi-label with the last layer unit = 3
, then the set-up should be more like
outputs = keras.layers.Dense(3, activation='sigmoid',
kernel_initializer=keras.initializers.TruncatedNormal(stddev=0.02),
name = 'real_output')(dense)
model.compile(AdamWarmup(decay_steps=decay_steps,
warmup_steps=warmup_steps, lr=LR),
loss='binary_crossentropy',
metrics=['accuracy'])
but if the problem set is a multi-class problem and your target labels are integer (unit = 3
) then the set-up should more like as follows:
outputs = keras.layers.Dense(3, activation='softmax',
kernel_initializer=keras.initializers.TruncatedNormal(stddev=0.02),
name = 'real_output')(dense)
model.compile(AdamWarmup(decay_steps=decay_steps,
warmup_steps=warmup_steps, lr=LR),
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'])
Upvotes: 2