Why replace the masked token with random token in bert?

Question

Masking in Bert is:

Take 15% of all the tokens in the sequence. These are to be used in computing the MLM loss
80% of the time retain the mask
10% of the time replace the mask with the original token
10% of the time replace the mask with a random token

I understand why masking is needed. I also understand why the masking is replaced back to the original token. This is because otherwise, the model learns to completely ignore the word itself in deriving its contextual embeddings in the downstream tasks.

What is the need to replace the mask with a random word 10% of the time?

Why replace the masked token with random token in bert?

Answers (1)

Related Questions