Aryo Pradipta Gema
Aryo Pradipta Gema

Reputation: 1230

How to add an attention mechanism in keras?

I'm currently using this code that i get from one discussion on github Here's the code of the attention mechanism:

_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=False
    )(_input)

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)


sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

probabilities = Dense(3, activation='softmax')(sent_representation)

Is this the correct way to do it? i was sort of expecting the existence of time distributed layer since attention mechanism is distributed in every time step of the RNN. I need someone to confirm that this implementation(the code) is a correct implementation of attention mechanism. Thank you.

Upvotes: 23

Views: 46337

Answers (5)

Allohvk
Allohvk

Reputation: 1374

While many good alternatives are given, I have tried to modify the code YOU have shared to make it work. I have also answered your other query that has not been addressed so far:

Q1. Is this the correct way to do it? The attention layer itself looks good. No changes needed. The way you have used the output of the attention layer can be slightly simplified and modified to incorporate some recent framework upgrades.

    sent_representation = merge.Multiply()([activations, attention])
    sent_representation = Lambda(lambda xin: K.sum(xin, axis=1))(sent_representation)

You are now good to go!

Q2. i was sort of expecting the existence of time distributed layer since attention mechanism is distributed in every time step of the RNN

No, you don't need a time distributed layer else the weights would be shared across timesteps which is not what you want.

You can refer to: https://towardsdatascience.com/create-your-own-custom-attention-layer-understand-all-flavours-2201b5e8be9e for other specific details

Upvotes: 1

Ashok Kumar Jayaraman
Ashok Kumar Jayaraman

Reputation: 3095

I think you can try the following code to add keras self-attention mechanism with LSTM network

    from keras_self_attention import SeqSelfAttention

    inputs = Input(shape=(length,))
    embedding = Embedding(vocab_size, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False)(inputs)
    lstm = LSTM(num_lstm, input_shape=(X[train].shape[0], X[train].shape[1]), return_sequences=True)(embedding)
    attn = SeqSelfAttention(attention_activation='sigmoid')(lstm)
    Flat = Flatten()(attn)
    dense = Dense(32, activation='relu')(Flat)
    outputs = Dense(3, activation='sigmoid')(dense)
    model = Model(inputs=[inputs], outputs=outputs)
    model.compile(loss='binary_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=10, batch_size=32,  validation_data=(X_val,y_val), shuffle=True)

Upvotes: 0

PixelPioneer
PixelPioneer

Reputation: 4170

Recently I was working with applying attention mechanism on a dense layer and here is one sample implementation:

def build_model():
  input_dims = train_data_X.shape[1]
  inputs = Input(shape=(input_dims,))
  dense1800 = Dense(1800, activation='relu', kernel_regularizer=regularizers.l2(0.01))(inputs)
  attention_probs = Dense( 1800, activation='sigmoid', name='attention_probs')(dense1800)
  attention_mul = multiply([ dense1800, attention_probs], name='attention_mul')
  dense7 = Dense(7, kernel_regularizer=regularizers.l2(0.01), activation='softmax')(attention_mul)   
  model = Model(input=[inputs], output=dense7)
  model.compile(optimizer='adam',
                loss='categorical_crossentropy',
                metrics=['accuracy'])
  return model

print (model.summary)

model.fit( train_data_X, train_data_Y_, epochs=20, validation_split=0.2, batch_size=600, shuffle=True, verbose=1)

enter image description here

Upvotes: 3

MJeremy
MJeremy

Reputation: 1250

Attention mechanism pays attention to different part of the sentence:

activations = LSTM(units, return_sequences=True)(embedded)

And it determines the contribution of each hidden state of that sentence by

  1. Computing the aggregation of each hidden state attention = Dense(1, activation='tanh')(activations)
  2. Assigning weights to different state attention = Activation('softmax')(attention)

And finally pay attention to different states:

sent_representation = merge([activations, attention], mode='mul')

I don't quite understand this part: sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

To understand more, you can refer to this and this, and also this one gives a good implementation, see if you can understand more on your own.

Upvotes: 2

Philippe Remy
Philippe Remy

Reputation: 1821

If you want to have an attention along the time dimension, then this part of your code seems correct to me:

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)

sent_representation = merge([activations, attention], mode='mul')

You've worked out the attention vector of shape (batch_size, max_length):

attention = Activation('softmax')(attention)

I've never seen this code before, so I can't say if this one is actually correct or not:

K.sum(xin, axis=-2)

Further reading (you might have a look):

Upvotes: 19

Related Questions