How is transformers loss calculated for blank token predictions?

Question

I'm currently trying to implement a transformer and have trouble understanding its loss calculation.

My encoders input looks for batch_size=1 and max_sentence_length=8 like:

[[Das, Wetter, ist, gut, , , , ]]

My decoders input looks like (german to english):

[[, The, weather, is, good, , , ]]

Let's say my transformer predicted those class probabilities (only showing the word for the class with the highest class probability):

[[The, good, is, weather, , , , ]]

Now I calculate the loss using:

loss = categorical_crossentropy(
   [[The, good, is, weather, , , , ]],
   [[The, weather, is, good, , , , ]]
)

Is this the correct way to calculate the loss? My transformer always predicts the blank token for the next word and I thought that's because I have a mistake in my loss calculation and have to do something with the blank tokens before calculating the loss.

Jindřich · Accepted Answer

You need to mask out the padding. (What you call is is more often called .)

Create a mask saying where the valid tokens are (pseudocode: mask = target != '')
When computing the categorical cross-entropy, do not automatically reduce the loss and keep the value.
Multiply the loss values with the mask, i.e., positions corresponding to the tokens get zero out and sum the losses at the valid positions. (pseudocode: loss_sum = (loss * mask).sum())
Divide the loss_sum by the number of valid position, i.e., the sum of the mask (pseudocode: loss = loss_sum / mask.sum())

How is transformers loss calculated for blank token predictions?

Answers (2)

Related Questions