Reputation: 159
In https://huggingface.co/learn/nlp-course/chapter7/6#preparing-the-dataset, there is
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
What the tutorial is doing is using a pretrained GPT2 model and its tokenizer and trying to create a dataset for causal language modeling pretraining task.
My question with the above line is that padding token is set to be the eos token. As a result even the original eos tokens will be ignored by the model during training since they will be perceived as padding tokens too.
This would prevent my model from learning to output eos tokens when its generation is over.
How come this is in the tutorials and it is a correct way ?
Upvotes: 1
Views: 7753
Reputation: 158
The question is a while ago, but I just want to clarify on top of the accepted answer.
The tutorial is for pretraining, where the texts are packed (concatenating multiple samples together as one training sample until max_seq_length
), so there is no padding tokens involved. The accepted answer is correct in this case.
However, since the major usecase for padding is instruction tuning/batched inference, it is important that the model is learning the EOS token so that it knows when to stop.
If you are not using Huggingface's Trainer (e.g. SFTTrainer) pipeline, it is easily to modify the logic; if you are, then you can customize the data_collator
's torch_call()
function, specifically:
if self.mlm:
batch["input_ids"], batch["labels"] = self.torch_mask_tokens(
batch["input_ids"], special_tokens_mask=special_tokens_mask
)
else:
labels = batch["input_ids"].clone()
if self.tokenizer.pad_token_id is not None:
labels[labels == self.tokenizer.pad_token_id] = -100
batch["labels"] = labels
Upvotes: 1
Reputation: 122052
Ignoring the EOS symbol when training a normal language model is okay. So padding the sequence with EOS instead of a dedicated PAD symbol is okay too.
When using DataCollatorForLanguageModeling(tokenizer, mlm=False)
, the "masked-language modeling" model is off and we are doing casual language modeling ,i.e. predicting the next word given the previous. Consider this:
['this', 'is', 'a', 'foobar', '.', 'EOS']
Now we pad the sequence until it's of length 10 tokens
['this', 'is', 'a', 'foobar', '.', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS']
When the model learns with causal language model, it's predicting the next word given the previous, i.e.
>>> predict(next_token, given=["BOS"])
'this'
>>> predict(next_token, given=["BOS", "this"])
'is'
...
>>> predict(next_token, given=["BOS", "this", "is", "a", "foobar", "."])
'EOS'
In most common inference routine, the model will stop once the first EOS
is predicted, or all beams in the search during inference produced their first EOS
.
During training, the model will learn:
ground_truth = [
'this', 'is', 'a', 'foobar', '.', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS',
]
ground_prediction = [
'this', 'is', 'foobar', '.', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS',
]
And when you compute the perplexity, all the PAD symbols are ignored, and in this case, when you treat the EOS as PAD, you are essentially tell the model even the first EOS
is not necessary when computing perplexity.
A: It depends on your task and what you want the 'EOS' to mean. For most natural language, we have punctuations before 'EOS', so EOS/PAD doesn't really matter. For programming language, we have '\n' and ';' or some end of sequence operator, so EOS
isn't that necessary too.
A: Actually that's a good question, we're padding so that the dot-products in transformer attentions can be "easily" computed.
But there are many cases where pad tokens can be efficiently packed, like in RNN https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_packed_sequence.html (IIRC, not in transformers architecture though)
But I don't know how much of that is already in Pytorch/JAX underlying library for "efficient" transformers, which will allow us to avoid pre-padding inputs. From my experience in using Huggingface Pytorch models, if you don't pad the inputs, most probably the model will complain when you do a forward pass =(
If only, someone fix that mathematically. Maybe someone did try but it's not that common to be largely used by most transformers pre-trained model (yet).
Upvotes: 7