What memory does Transformer Decoder Only use?

Question

I've been reading a lot about transformers and self attention and have seen both BERT and GPT-2 are a newer version that only use an encoder transformer (BERT) and decoder transformer (GPT-2). I've been trying to build a decoder only model for myself for next sequence prediction but am confused by one thing. I'm using PyTorch and have looked at thereSeq2Seq tutorial and then looked into the Transformer Decoder Block which is made up of Transformer Decoder Layers. My confusion comes from the memory these need to be passed as well. In the documentation they say memory is the last layer of the encoder block which makes sense for a Seq2Seq model but I'm wanting to make a decoder only model. So my question is what do you pass a decoder only model like GPT-2 for memory if you do not have an encoder?

What memory does Transformer Decoder Only use?

Answers (1)

Related Questions