Reputation: 11
I am very confused of why the padding side matters in a decoder only Model. If we give the Model the attention mask, no matter if it's left or right padding, the masked matmul scaled dot product of the padding tokens would be negative big numbers. Doesn't that mean the actual tokens weights would be the same no matter of the padding side?
I want to have a mathematical explanation please
Upvotes: 0
Views: 547
Reputation: 11
The padding side doesn't matter theoretically. Even arbitrarily 0-1 attention mask (like [[1,1,1,0,1,1]]
) is ok if the modeling code can deal with it.
But padding right/left during training/evalidation will bring convenience. For example if input is left padded, you can always retrieve new token with batch_new_token_id = output_logits[:, -1]
instead of using gather or index to get same result. If input is right padded, say [[1,1,1,0,0], [1,1,1,1,1],[1,1,1,1,0]]
, you need to do batch_new_token_id = outputs[:, attention_mask.argmin(-1) - 1]
. The latter takes more time and a little bit harder to understand.
Upvotes: 0