amy
amy

Reputation: 352

Bi-LSTM Attention model in Keras

I am trying to make an attention model with Bi-LSTM using word embeddings. I came across How to add an attention mechanism in keras?, https://github.com/philipperemy/keras-attention-mechanism/blob/master/attention_lstm.py and https://github.com/keras-team/keras/issues/4962.

However, I am confused about the implementation of Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. So,

_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = Embedding(
        input_dim=30000,
        output_dim=300,
        input_length=100,
        trainable=False,
        mask_zero=False
    )(_input)

activations = Bidirectional(LSTM(20, return_sequences=True))(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)

I am confused here as to which equation is to what from the paper.

attention = Flatten()(attention)
attention = Activation('softmax')(attention)

What will RepeatVector do?

attention = RepeatVector(20)(attention)
attention = Permute([2, 1])(attention)


sent_representation = merge([activations, attention], mode='mul')

Now, again I am not sure why this line is here.

sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

Since I have two classes, I will have the final softmax as:

probabilities = Dense(2, activation='softmax')(sent_representation)

Upvotes: 3

Views: 6672

Answers (1)

valbarriere
valbarriere

Reputation: 41

attention = Flatten()(attention)  

transform your tensor of attention weights in a vector (of size max_length if your sequence size is max_length).

attention = Activation('softmax')(attention)

allows having all the attention weights between 0 and 1, the sum of all the weights equal to one.

attention = RepeatVector(20)(attention)
attention = Permute([2, 1])(attention)


sent_representation = merge([activations, attention], mode='mul')

RepeatVector repeat the attention weights vector (which is of size max_len) with the size of the hidden state (20) in order to multiply the activations and the hidden states element-wise. The size of the tensor variable activations is max_len*20.

sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

This Lambda layer sum the weighted hidden states vectors in order to obtain the vector that will be used at the end.

Hope this helped!

Upvotes: 3

Related Questions