Reputation: 592
I'm currently trying to compute this function to get Bahdanau's attention
My question is with the H for the decoder and the encoder.
In one implementation, I see an h encoder with the dimensions: [max source Len, batch size, hidden size]
and a h decoder with the following dimensions: [#lstm layers, batch size, hidden dim]
How can I compute the addition if the dimensions for the W matrices have to be the same according to: https://blog.floydhub.com/attention-mechanism/#bahdanau-att-step1
Thanks for the help
Upvotes: 0
Views: 179
Reputation: 11240
In the original Bahdanau's paper, the decoder has only a single LSTM layer. There are various approaches how to deal with multiple layers. The quite usual thing to do is to do the attention between the layers (which you obviously did not do, see e.g., a paper by Google). If you use multiple decoder layers like this, you can use only the last layer (i.e., do h_decoder[1]
), alternatively, you can concatenate the layers (i.e., in torch call torch.cat
or tf.concat
in the 0-th dimension).
The matrices Wdecoder and Wencoder ensure that both the encoder and decoder states get projected to the same dimension (regardless if you what you did with the decoder layers), so you can do the summation.
The only remaining issue is that the encoder states have the max-length dimension. The trick here is that you need to add a dimension to the projected decoder state, so the summation gets broadcasted and the projected decoder state get summed with all the encoder states. In PyTorch, just call unsqueeze
, in TensorFlow expand_dims
in the 0-th dimension on the projected decoder state.
Upvotes: 1