Reputation: 581
I'm looking for suggestions on using Bert and Bert's masked language model to predict multiple tokens.
My data looks like:
context: some very long context paragraph
question: rainy days lead to @placeholder
and the answer for this @placeholder
is wet weather
. In the model, wet environment
is the answer to predict.
So at the pre-processing stage, should I change the text into rainy days lead to [MASK]
or something like rainy days lead to [MASK] [MASK]
? I know that the masked LM works well on the single token prediction, do you think the masked LM can work well on the multiple tokens prediction? If no, do you have any suggestions on how to pre-process and train this kind of data?
Thanks so much!
Upvotes: 5
Views: 4609
Reputation: 1080
So there are 3 questions :
First,
So at the pre-processing stage, should I change the text into rainy days lead to [MASK] or something like rainy days lead to [MASK] [MASK]?
In a word point of view, you should set [MASK] [MASK]. But remember that in BERT, the mask is set at a token point of view. In fact, 'wet weather' may be tokenized in something like : [wet] [weath] [##er], and in this case, you should have [MASK] [MASK] [MASK]. So one [MASK] per token.
Second,
I know that the masked LM works well on the single token prediction, do you think the masked LM can work well on the multiple tokens prediction?
As you can read it in the original paper, they said :
The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time.
They notice no limitation in the amount of MASKED token per sentence, you have several MASKED token during pre-training BERT. In my own experience, I pre-trained BERT several times and I noticed that there were almost non differences between the prediction made on MASKED token if there were only one or more MASKED token in my input.
Third,
If no, do you have any suggestions on how to pre-process and train this kind of data?
So the answer is yes, but if you really want to MASK elements you choose (and not randomly like in the paper), you should adapt the MASK when the data will be tokenized because the number of MASKED token will be greater (or equal) that the number of MASK in the word space you set (like the example I gave you : 1 word is not equals to 1 token, so basically, 1 MASKED word will be 1 or more MASK token). But honestly, the process of labellisation will be so huge I recommend you to increase the 15% of probability for MASK tokien or make a process that MASK the 1 or 2 next token for each MASKED token (or something like this)..
Upvotes: 3