Reputation: 5771
Masking in Bert is:
I understand why masking is needed. I also understand why the masking is replaced back to the original token. This is because otherwise, the model learns to completely ignore the word itself in deriving its contextual embeddings in the downstream tasks.
What is the need to replace the mask with a random word 10% of the time?
Upvotes: 1
Views: 633
Reputation: 11
The main purpose is to help the model not overfit which means preventing the model memorizing the training data (10% of the original). Or you can regard it as the noise added to improve the robustness of the learning. The noise could be considered as the imperfections of the real world.
Upvotes: 1