Bahdanaus attention in Neural machine translation with attention

Question

I am trying to understand Bahdanaus attention using the following tutorial: https://www.tensorflow.org/tutorials/text/nmt_with_attention

The calculation is the following:

self.attention_units = attention_units
self.W1 = Dense(self.attention_units)
self.W2 = Dense(self.attention_units)
self.V = Dense(1)

score = self.V(tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)))

I have two problems:

I cannot understand why the shape of tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) is (batch_size,max_len,attention_units) ?

Using the rules of matrix multiplication I got the following results:

a) Shape of self.W1(last_inp_dec) -> (1,hidden_units_dec) * (hidden_units_dec,attention_units) = (1,attention_units)

b) Shape of self.W2(last_inp_enc) -> (max_len,hidden_units_dec) * (hidden_units_dec,attention_units) = (max_len,attention_units)

Then we add up a) and b) quantities. How do we end up with dimensionality (max_len, attention_units) or (batch_size, max_len, attention_units)? How can we do addition with different size of second dimension (1 vs max_len)?
Why do we multiply tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) by self.V? Because we want alphas as scalar?

user1216456 · Accepted Answer

) I cannot understand why the shape of tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) is (batch_size,max_len,attention_units) ?

From the comments section of the code in class BahdanauAttention

query_with_time_axis shape = (batch_size, 1, hidden size)

Note that the dimension 1 was added using tf.expand_dims to make the shape compatible with values for the addition. The added dimension of 1 gets broadcast during the addition operation. Otherwise, the incoming shape was (batch_size, hidden size), which would not have been compatible

values shape = (batch_size, max_len, hidden size)

Addition of the query_with_time_axis shape and values shape gives us a shape of (batch_size, max_len, hidden size)

) Why do we multiply tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) by self.V? Because we want alphas as scalar?

self.V is the final layer, the output of which gives us the score. The random weight initialization of the self.V layer is handled by keras behind the scene in the line self.V = tf.keras.layers.Dense(1).

We are not multiplying tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) by self.V.

The construct self.V(tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) means --> the tanh activations resulting from the operation tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) form the input matrix to the single output output layer represented by self.V.

Bahdanaus attention in Neural machine translation with attention

Answers (2)

Related Questions