mg nt
mg nt

Reputation: 191

Context vector shape using Bahdanau Attention

I am looking here at the Bahdanau attention class. I noticed that the final shape of the context vector is (batch_size, hidden_size). I am wondering how they got that shape given that attention_weights has shape (batch_size, 64, 1) and features has shape (batch_size, 64, embedding_dim). They multiplied the two (I believe it is a matrix product) and then summed up over the first axis. Where is the hidden size coming from in the context vector?

Upvotes: 1

Views: 480

Answers (2)

Allohvk
Allohvk

Reputation: 1374

The answer given is incorrect. Let me explain why first, before I share what the actual answer is.

Take a look at the concerned code in the hyperlink provided. The 'hidden size' in the code refers to the dimensions of the hidden state of the decoder and NOT the hidden state(s) of the encoder as the answer above has assumed. The above multiplication in code will yield (batch_size, embedding_dim) as the question-framer mg_nt rightly points out.. The context is a weighted sum of encoder output and SHOULD have the SAME dimension as the encoder o/ps. Mathematically also one should NOT get (batch size, hidden size).

Of course in this case they are using Attention over a CNN. So there is no encoder as such but the image is broken down into features. These features are collected from the last but 1 layer and each feature is a specific component of the overall image. The hidden state from the decoder..ie the query, 'attend's to all these features and decides which ones are important and need to be given a higher weightage to determine the next word in the caption. The features shape in the above code is (batch_size, embedding_dim) and hence the context shape after being magnified or diminished by the attention weights will also be (batch_size, embedding_dim)!

This is simply a mistake in the comments of the concerned code (the code functionality itself seems right). The shape mentioned in the comments are incorrect. If you search the code for 'hidden_size' there is no such variable. It is only mentioned in the comments. If you further look at the declaration of the encoder and decoder they are using the same embedding size for both. So the code works, but the comments in the code are misleading and incorrect. That is all there is to it.

Upvotes: 0

thushv89
thushv89

Reputation: 11333

The context vector resulting from Bahdanau attention is a weighted average of all the hidden states of the encoder. The following image from Ref shows how this is calculated. Essentially we do the following.

  1. Compute attention weights, which is a (batch size, encoder time steps, 1) sized tensor
  2. Multiply each hidden state (batch size, hidden size) element-wise with e values. Resulting in (batch_size, encoder timesteps, hidden size)
  3. Average over the time dimension, resulting in (batch size, hidden size)

enter image description here

Upvotes: 1

Related Questions