Custom Transformer Decoder Layer Implementation EECS598

Question

I am working on the assignments for EECS598(I am not a student) - Deep Learning for Computer Vision taught by Justin Johnson, and my implementation does not seem to be working for the Decoder Layer implementation of Assignment 5, which can be found here. I have looked at other sources, and I am fairly sure that my implementation is right, so I am getting convinced everyday that the answer provided in the notebook is wrong.

My implementation of DecoderBlock currently is as follows, but there are other layers in this implementation that were implemented by me, and I might have something wrong there. All the other layers match with the expected answers in the notebook, so I highly doubt that.

Can someone please take a look at the notebook and let me know if they are able to get the expected answer on the notebook? Thank you!

class DecoderBlock(nn.Module):
    def __init__(
        self, num_heads: int, emb_dim: int, feedforward_dim: int, dropout: float
    ):
        super().__init__()
        if emb_dim % num_heads != 0:
            raise ValueError(
                f"""The value emb_dim = {emb_dim} is not divisible
                             by num_heads = {num_heads}. Please select an
                             appropriate value."""
            )

        """
        The function implements the DecoderBlock for the Transformer model. In the 
        class we learned about encoder only model that can be used for tasks like 
        sequence classification but for more complicated tasks like sequence to 
        sequence we need a decoder network that can transformt the output of the 
        encoder to a target sequence. This kind of architecture is important in 
        tasks like language translation where we have a sequence as input and a 
        sequence as output. 
        
        As shown in the Figure 1 of the paper attention is all you need
        https://arxiv.org/pdf/1706.03762.pdf, the encoder consists of 5 components:   
        
        1. Masked MultiHead Attention
        2. MultiHead Attention
        3. FeedForward layer
        4. Residual connections after MultiHead Attention and feedforward layer
        5. LayerNorm        
        
        The Masked MultiHead Attention takes the target, masks it as per the 
        function get_subsequent_mask and then gives the output as per the MultiHead  
        Attention layer. Further, another Multihead Attention block here takes the  
        encoder output and the output from Masked Multihead Attention layer giving  
        the output that helps the model create interaction between input and 
        targets. As this block helps in interation of the input and target, it  
        is also sometimes called the cross attention.

        The architecture is as follows:
        
        inp - masked_multi_head_attention - out1 - layer_norm(inp + out1) - \
        dropout - (out2 and enc_out) -  multi_head_attention - out3 - \
        layer_norm(out3 + out2) - dropout - out4 - feed_forward - out5 - \
        layer_norm(out5 + out4) - dropout - out
        
        Here, out1, out2, out3, out4, out5 are the corresponding outputs for the 
        layers, enc_out is the encoder output and we add these outputs to their  
        respective inputs for implementing residual connections.
        
        args:
            num_heads: int value representing number of heads

            emb_dim: int value representing embedding dimension

            feedforward_dim: int representing hidden layers in the feed forward 
                model

            dropout: float representing the dropout value
        """
        self.attention_self = None
        self.attention_cross = None
        self.feed_forward = None
        self.norm1 = None
        self.norm2 = None
        self.norm3 = None
        self.dropout = None
        self.feed_forward = None
        ##########################################################################
        # TODO: Initialize the following layers:                                 #
        # 1. Two MultiheadAttention layers with num_heads number of heads, emb_dim
        #     as the embedding dimension. As done in Encoder, you should be able to
        #     figure out the output dimension of both the MultiHeadAttention.    #
        # 2. One FeedForward block that takes in emb_dim as input dimension and  #
        #   feedforward_dim as hidden layers                                     #
        # 3. LayerNormalization layers after each of the block                   #
        # 4. Dropout after each of the block                                     #
        ##########################################################################

        # Replace "pass" statement with your code
        self.attention_self = MultiHeadAttention(num_heads=num_heads, dim_in=emb_dim, dim_out=emb_dim//num_heads)
        self.attention_cross = MultiHeadAttention(num_heads=num_heads, dim_in=emb_dim, dim_out=emb_dim//num_heads)
        self.layer_norm1 = LayerNormalization(emb_dim=emb_dim)
        self.layer_norm2 = LayerNormalization(emb_dim=emb_dim)
        self.layer_norm3 = LayerNormalization(emb_dim=emb_dim)
        self.feed_forward = FeedForwardBlock(inp_dim=emb_dim, hidden_dim_feedforward=feedforward_dim)
        self.dropout = nn.Dropout(p=dropout)
        ##########################################################################
        #               END OF YOUR CODE                                         #
        ##########################################################################

    def forward(
        self, dec_inp: Tensor, enc_inp: Tensor, mask: Tensor = None
    ) -> Tensor:

        """
        args:
            dec_inp: a Tensor of shape (N, K, M)
            enc_inp: a Tensor of shape (N, K, M)
            mask: a Tensor of shape (N, K, K)

        This function will handle the forward pass of the Decoder block. It takes
        in input as enc_inp which is the encoder output and a tensor dec_inp which
        is the target sequence shifted by one in case of training and an initial
        token "BOS" during inference
        """
        y = None
        ##########################################################################
        # TODO: Using the layers initialized in the init function, implement the #
        # forward pass of the decoder block. Pass the dec_inp to the             #
        # self.attention_self layer. This layer is responsible for the self      #
        # interation of the decoder input. You should follow the Figure 1 in     #
        # Attention is All you need paper to implenment the rest of the forward  #
        # pass. Don't forget to apply the residual connections for different layers.
        ##########################################################################
        # Replace "pass" statement with your code
        out = self.attention_self(dec_inp, dec_inp, dec_inp, mask)
        out = self.dropout(self.layer_norm1(dec_inp + out))
        out2 = self.attention_cross(out, enc_inp, enc_inp)
        out = self.dropout(self.layer_norm2(out2 + out))
        out2 = self.feed_forward(out)
        y = self.dropout(self.layer_norm3(out + out2))
        ##########################################################################
        #               END OF YOUR CODE                                         #
        ##########################################################################
        return y

Custom Transformer Decoder Layer Implementation EECS598

Answers (0)

Related Questions