bellerb
bellerb

Reputation: 157

What memory does Transformer Decoder Only use?

I've been reading a lot about transformers and self attention and have seen both BERT and GPT-2 are a newer version that only use an encoder transformer (BERT) and decoder transformer (GPT-2). I've been trying to build a decoder only model for myself for next sequence prediction but am confused by one thing. I'm using PyTorch and have looked at thereSeq2Seq tutorial and then looked into the Transformer Decoder Block which is made up of Transformer Decoder Layers. My confusion comes from the memory these need to be passed as well. In the documentation they say memory is the last layer of the encoder block which makes sense for a Seq2Seq model but I'm wanting to make a decoder only model. So my question is what do you pass a decoder only model like GPT-2 for memory if you do not have an encoder?

Upvotes: 5

Views: 3858

Answers (1)

bellerb
bellerb

Reputation: 157

After further investigation I believe I can now answer this myself. A decoder only transformer doesn't actually use any memory as there is no encoder-decoder self attention in it like there is in a encoder-decoder transformer. A decoder only transformer looks a lot like an encoder transformer only instead it uses a masked self attention layer over a self attention layer. In order to do this you can pass a square subsequent mask (upper triangle) so that the model cannot look forward to achieve a decoder only model like found in GPT-2/GPT-3.

Upvotes: 1

Related Questions