Reputation: 375
During the generation phase in HuggingFace's code: https://github.com/huggingface/transformers/blob/master/src/transformers/generation_utils.py#L88-L100
They pass in a decoder_start_token_id
, I'm not sure why they need this. And in the BART config, the decoder_start_token_id
is actually 2
(https://huggingface.co/facebook/bart-base/blob/main/config.json), which is the end of sentence token </s>
.
And I tried a simple example:
from transformers import *
import torch
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
input_ids = torch.LongTensor([[0, 894, 213, 7, 334, 479, 2]])
res = model.generate(input_ids, num_beams=1, max_length=100)
print(res)
preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True).strip() for g in res]
print(preds)
The results I obtained:
tensor([[ 2, 0, 894, 213, 7, 334, 479, 2]])
['He go to school.']
Though it does not affect the final "tokenization decoding" results. But it seems weird to me that the first token we generate is actually 2
(</s>
).
Upvotes: 6
Views: 8334
Reputation: 2958
You can see in the code for encoder-decoder models that the input tokens for the decoder are right-shifted from the original (see function shift_tokens_right
). This means that the first token to guess is always BOS (beginning of sentence). You can check that this is the case in your example.
For the decoder to understand this, we must choose a first token that is always followed by BOS, so which could it be? BOS? Obviously not because it must be followed by regular tokens. The padding token? Also not a good choice because it is followed by another padding token or by EOS (end of sentence). So, what about EOS then? Well, that makes sense because it is never followed by anything in the training set so there is no next token coming in conflict. And besides, isn't it natural that the beginning of sentence follows the end of another?
Upvotes: 6