Question about tokens used in Transformer decoder attention layers during Inference

Question

I was looking at the shapes used during decoder (both self-attention and enc-dec-attention blocks) and understand there is a difference in the way decoder runs during training versus during inference based on this link and the original Attention paper

In Inference, it uses all previous tokens generated until that time step (say kth time-step), as shown in the diagram below and explained at this link.

Issue:

However when I look at actual shapes of the QKV projection in the decoder self-attention, and feeding of the decoder self-attention output to the "enc-dec-attention"'s Q matrix, I see only 1 token from the output being used.

I'm very confused how the shapes for all matrices in the Decoder's self-attention and enc-dec-attention can match up with variable length of input to the decoder during inference. I looked at several online material but couldn't find answer. I see only the BGemms in the decoder's self-attention (not enc-dec-attention) using the variable shapes until all previous k steps, but all other Gemms are fixed size.

How is that possible? Is only 1 token (last one from decoder output) is being used for qkv matmuls in self-attention and Q-matmul in enc-dec-attention (which is what I see when running the model)?
Could someone elaborate how all these shapes for QKV in self-attention and Q in enc-dec-attention match up with decoder input length being different at each time-step?**

Another diagram that shows self-attention and enc-dec-attention within decoder:

Question about tokens used in Transformer decoder attention layers during Inference

Answers (1)

Related Questions