Reputation: 23
I am working on the assignments for EECS598(I am not a student) - Deep Learning for Computer Vision taught by Justin Johnson, and my implementation does not seem to be working for the Decoder Layer implementation of Assignment 5, which can be found here. I have looked at other sources, and I am fairly sure that my implementation is right, so I am getting convinced everyday that the answer provided in the notebook is wrong.
My implementation of DecoderBlock
currently is as follows, but there are other layers in this implementation that were implemented by me, and I might have something wrong there. All the other layers match with the expected answers in the notebook, so I highly doubt that.
Can someone please take a look at the notebook and let me know if they are able to get the expected answer on the notebook? Thank you!
class DecoderBlock(nn.Module):
def __init__(
self, num_heads: int, emb_dim: int, feedforward_dim: int, dropout: float
):
super().__init__()
if emb_dim % num_heads != 0:
raise ValueError(
f"""The value emb_dim = {emb_dim} is not divisible
by num_heads = {num_heads}. Please select an
appropriate value."""
)
"""
The function implements the DecoderBlock for the Transformer model. In the
class we learned about encoder only model that can be used for tasks like
sequence classification but for more complicated tasks like sequence to
sequence we need a decoder network that can transformt the output of the
encoder to a target sequence. This kind of architecture is important in
tasks like language translation where we have a sequence as input and a
sequence as output.
As shown in the Figure 1 of the paper attention is all you need
https://arxiv.org/pdf/1706.03762.pdf, the encoder consists of 5 components:
1. Masked MultiHead Attention
2. MultiHead Attention
3. FeedForward layer
4. Residual connections after MultiHead Attention and feedforward layer
5. LayerNorm
The Masked MultiHead Attention takes the target, masks it as per the
function get_subsequent_mask and then gives the output as per the MultiHead
Attention layer. Further, another Multihead Attention block here takes the
encoder output and the output from Masked Multihead Attention layer giving
the output that helps the model create interaction between input and
targets. As this block helps in interation of the input and target, it
is also sometimes called the cross attention.
The architecture is as follows:
inp - masked_multi_head_attention - out1 - layer_norm(inp + out1) - \
dropout - (out2 and enc_out) - multi_head_attention - out3 - \
layer_norm(out3 + out2) - dropout - out4 - feed_forward - out5 - \
layer_norm(out5 + out4) - dropout - out
Here, out1, out2, out3, out4, out5 are the corresponding outputs for the
layers, enc_out is the encoder output and we add these outputs to their
respective inputs for implementing residual connections.
args:
num_heads: int value representing number of heads
emb_dim: int value representing embedding dimension
feedforward_dim: int representing hidden layers in the feed forward
model
dropout: float representing the dropout value
"""
self.attention_self = None
self.attention_cross = None
self.feed_forward = None
self.norm1 = None
self.norm2 = None
self.norm3 = None
self.dropout = None
self.feed_forward = None
##########################################################################
# TODO: Initialize the following layers: #
# 1. Two MultiheadAttention layers with num_heads number of heads, emb_dim
# as the embedding dimension. As done in Encoder, you should be able to
# figure out the output dimension of both the MultiHeadAttention. #
# 2. One FeedForward block that takes in emb_dim as input dimension and #
# feedforward_dim as hidden layers #
# 3. LayerNormalization layers after each of the block #
# 4. Dropout after each of the block #
##########################################################################
# Replace "pass" statement with your code
self.attention_self = MultiHeadAttention(num_heads=num_heads, dim_in=emb_dim, dim_out=emb_dim//num_heads)
self.attention_cross = MultiHeadAttention(num_heads=num_heads, dim_in=emb_dim, dim_out=emb_dim//num_heads)
self.layer_norm1 = LayerNormalization(emb_dim=emb_dim)
self.layer_norm2 = LayerNormalization(emb_dim=emb_dim)
self.layer_norm3 = LayerNormalization(emb_dim=emb_dim)
self.feed_forward = FeedForwardBlock(inp_dim=emb_dim, hidden_dim_feedforward=feedforward_dim)
self.dropout = nn.Dropout(p=dropout)
##########################################################################
# END OF YOUR CODE #
##########################################################################
def forward(
self, dec_inp: Tensor, enc_inp: Tensor, mask: Tensor = None
) -> Tensor:
"""
args:
dec_inp: a Tensor of shape (N, K, M)
enc_inp: a Tensor of shape (N, K, M)
mask: a Tensor of shape (N, K, K)
This function will handle the forward pass of the Decoder block. It takes
in input as enc_inp which is the encoder output and a tensor dec_inp which
is the target sequence shifted by one in case of training and an initial
token "BOS" during inference
"""
y = None
##########################################################################
# TODO: Using the layers initialized in the init function, implement the #
# forward pass of the decoder block. Pass the dec_inp to the #
# self.attention_self layer. This layer is responsible for the self #
# interation of the decoder input. You should follow the Figure 1 in #
# Attention is All you need paper to implenment the rest of the forward #
# pass. Don't forget to apply the residual connections for different layers.
##########################################################################
# Replace "pass" statement with your code
out = self.attention_self(dec_inp, dec_inp, dec_inp, mask)
out = self.dropout(self.layer_norm1(dec_inp + out))
out2 = self.attention_cross(out, enc_inp, enc_inp)
out = self.dropout(self.layer_norm2(out2 + out))
out2 = self.feed_forward(out)
y = self.dropout(self.layer_norm3(out + out2))
##########################################################################
# END OF YOUR CODE #
##########################################################################
return y
Upvotes: 0
Views: 147