NathanaëlBeau
NathanaëlBeau

Reputation: 1152

Masked language model processing, deeper explanation

I'm looking to BERT model (you can found the description here) in detail and I'm getting problem to understand clearly the need to keep or replace random word 20% of the time instead or just use [MASK] token always for the masked language model.

We try to train the bidirectional technique and the article explains "[MASK] token is never seen during fine-tuning" but it is two different steps for me, we train first bidirectional and after we downstream task.

If someone can explain to me where I'm wrong in my comprehension.

Upvotes: 0

Views: 1077

Answers (1)

Separius
Separius

Reputation: 1296

If you don't use random replacement during training your network won't learn to extract useful features from non-masked tokens.

in other words, if you only use masking and try to predict them, it will be a waste of resources for your network to extract good features for the non-masked tokens(remember that your network is as good as your task and it will try to find the easiest way to solve your task)

Upvotes: 1

Related Questions