Reputation: 437
I am trying to understand Bahdanaus attention using the following tutorial: https://www.tensorflow.org/tutorials/text/nmt_with_attention
The calculation is the following:
self.attention_units = attention_units
self.W1 = Dense(self.attention_units)
self.W2 = Dense(self.attention_units)
self.V = Dense(1)
score = self.V(tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)))
I have two problems:
I cannot understand why the shape of tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc))
is (batch_size,max_len,attention_units) ?
Using the rules of matrix multiplication I got the following results:
a) Shape of self.W1(last_inp_dec) -> (1,hidden_units_dec) * (hidden_units_dec,attention_units) = (1,attention_units)
b) Shape of self.W2(last_inp_enc) -> (max_len,hidden_units_dec) * (hidden_units_dec,attention_units) = (max_len,attention_units)
Then we add up a) and b) quantities. How do we end up with dimensionality (max_len, attention_units) or (batch_size, max_len, attention_units)? How can we do addition with different size of second dimension (1 vs max_len)?
Why do we multiply tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc))
by self.V
? Because we want alphas as scalar?
Upvotes: 1
Views: 163
Reputation: 1356
The shapes are slightly different from the ones you have given. It is best understood with a direct example perhaps?
Assuming 10 units in the alignment layer and 128 embedding dimensions on the decoder and 256 dimensions on the encoder and 19 timesteps, then:
last_inp_dec and input_enc shapes would be (?,128) and (?,19,256). We need to now expand last_inp_dec over the time axis to make it (?,1,128) so that addition is possible.
The layer weights for w1,w2,v will be (?,128,10), (?,256,10) and (?,10,1) respectively. Notice how self.w1(last_inp_dec) works out to (?,1,10). This is added to each of the self.w2(input_enc) to give a shape of (?,19,10). The result is fed to self.v and the output is (?,19,1) which is the shape we want - a set of 19 weights. Softmaxing this gives the attention weights.
Multiplying this attention weight with each encoder hidden state and summing up returns the context.
To your question as to why 'v' is needed, it is needed because Bahdanau provides the option of using 'n' units in the alignment layer (to determine w1,w2) and we need one more layer on top to massage the tensor back to the shape we want - a set of attention weights..one for each time step.
I just posted an answer at Understanding Bahdanau's Attention Linear Algebra with all the shapes the tensors and weights involved.
Upvotes: 0
Reputation:
- ) I cannot understand why the shape of tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) is (batch_size,max_len,attention_units) ?
From the comments section of the code in class BahdanauAttention
query_with_time_axis shape = (batch_size, 1, hidden size)
Note that the dimension 1
was added using tf.expand_dims
to make the shape compatible with values
for the addition. The added dimension of 1
gets broadcast during the addition operation. Otherwise, the incoming shape was (batch_size, hidden size), which would not have been compatible
values shape = (batch_size, max_len, hidden size)
Addition of the query_with_time_axis
shape and values
shape gives us a shape of (batch_size, max_len, hidden size)
- ) Why do we multiply
tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc))
by self.V? Because we want alphas as scalar?
self.V
is the final layer, the output of which gives us the score. The random weight initialization of the self.V
layer is handled by keras
behind the scene in the line self.V = tf.keras.layers.Dense(1)
.
We are not multiplying tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc))
by self.V
.
The construct self.V(tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc))
means --> the tanh
activations resulting from the operation tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc))
form the input matrix to the single output output layer represented by self.V
.
Upvotes: 1