Ire21
Ire21

Reputation: 59

Nan loss in keras with triplet loss

I'm trying to learn an embedding for Paris6k images combining VGG and Adrian Ung triplet loss. The problem is that after a small amount of iterations, in the first epoch, the loss becomes nan, and then the accuracy and validation accuracy grow to 1.

I've already tried lowering the learning rate, increasing the batch size (only to 16 beacuse of memory), changing optimizer (Adam and RMSprop), checking if there are None values on my dataset, changing data format from 'float32' to 'float64', adding a little bias to them and simplify the model.

Here is my code:

base_model = VGG16(include_top = False, input_shape = (512, 384, 3))
input_images = base_model.input
input_labels = Input(shape=(1,), name='input_label')

embeddings = Flatten()(base_model.output)
labels_plus_embeddings = concatenate([input_labels, embeddings])

model = Model(inputs=[input_images, input_labels], outputs=labels_plus_embeddings)

batch_size = 16
epochs = 2
embedding_size = 64

opt = Adam(lr=0.0001)

model.compile(loss=tl.triplet_loss_adapted_from_tf, optimizer=opt, metrics=['accuracy'])

label_list = np.vstack(label_list)

x_train = image_list[:2500]
x_val = image_list[2500:]

y_train = label_list[:2500]
y_val = label_list[2500:]

dummy_gt_train = np.zeros((len(x_train), embedding_size + 1))
dummy_gt_val = np.zeros((len(x_val), embedding_size + 1))

H = model.fit(
    x=[x_train,y_train],
    y=dummy_gt_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=([x_val, y_val], dummy_gt_val),callbacks=callbacks_list)

The images are 3366 with values scaled in range [0, 1]. The network takes dummy values because it tries to learn embeddings from images in a way that images of the same class should have small distance, while images of different classes should have high distances and than the real class is part of the training.

I've noticed that I was previously making an incorrect class division (and keeping images that should be discarded), and I didn't have the nan loss problem.

What should I try to do?

Thanks in advance and sorry for my english.

Upvotes: 3

Views: 2079

Answers (3)

ma7555
ma7555

Reputation: 400

I have implemented a package for triplet generation so that every batch is guaranteed to include postive pairs. It is compatible with TF/Keras only.

https://github.com/ma7555/kerasgen (Disclaimer: I am the owner)

Upvotes: 1

Akash Kumar
Akash Kumar

Reputation: 483

Try increasing your batch size. It happened to me also. As mentioned in the previous answer, network is unable to find any num_positives. I had 250 classes and was getting nan loss initially. I increased it to 128/256 and then there was no issue.

I saw that Paris6k has 15 classes or 12 classes. Increase your batch size 32 and if the GPU memory occurs you can try with model with less parameters. You can work on Efficient B0 model for starting. It has 5.3M compared to VGG16 which has 138M parameters.

Upvotes: 5

Louis T.
Louis T.

Reputation: 81

In some case, the random NaN loss can be caused by your data, because if there are no positive pairs in your batch, you will get a NaN loss.

As you can see in Adrian Ung's notebook (or in tensorflow addons triplet loss; it's the same code) :

semi_hard_triplet_loss_distance = math_ops.truediv(
        math_ops.reduce_sum(
            math_ops.maximum(
                math_ops.multiply(loss_mat, mask_positives), 0.0)),
        num_positives,
        name='triplet_semihard_loss')

There is a division by the number of positives pairs (num_positives), which can lead to NaN.

I suggest you try to inspect your data pipeline in order to ensure there is at least one positive pair in each of your batches. (You can for example adapt some of the code in the triplet_loss_adapted_from_tf to get the num_positives of your batch, and check if it is greater than 0).

Upvotes: 5

Related Questions