Reputation: 352
I am trying to make an attention model with Bi-LSTM using word embeddings. I came across How to add an attention mechanism in keras?, https://github.com/philipperemy/keras-attention-mechanism/blob/master/attention_lstm.py and https://github.com/keras-team/keras/issues/4962.
However, I am confused about the implementation of Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification
. So,
_input = Input(shape=[max_length], dtype='int32')
# get the embedding layer
embedded = Embedding(
input_dim=30000,
output_dim=300,
input_length=100,
trainable=False,
mask_zero=False
)(_input)
activations = Bidirectional(LSTM(20, return_sequences=True))(embedded)
# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
I am confused here as to which equation is to what from the paper.
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
What will RepeatVector do?
attention = RepeatVector(20)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
Now, again I am not sure why this line is here.
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)
Since I have two classes, I will have the final softmax as:
probabilities = Dense(2, activation='softmax')(sent_representation)
Upvotes: 3
Views: 6672
Reputation: 41
attention = Flatten()(attention)
transform your tensor of attention weights in a vector (of size max_length if your sequence size is max_length).
attention = Activation('softmax')(attention)
allows having all the attention weights between 0 and 1, the sum of all the weights equal to one.
attention = RepeatVector(20)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
RepeatVector repeat the attention weights vector (which is of size max_len) with the size of the hidden state (20) in order to multiply the activations and the hidden states element-wise. The size of the tensor variable activations is max_len*20.
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)
This Lambda layer sum the weighted hidden states vectors in order to obtain the vector that will be used at the end.
Hope this helped!
Upvotes: 3