Reputation: 1230
I'm currently using this code that i get from one discussion on github Here's the code of the attention mechanism:
_input = Input(shape=[max_length], dtype='int32')
# get the embedding layer
embedded = Embedding(
input_dim=vocab_size,
output_dim=embedding_size,
input_length=max_length,
trainable=False,
mask_zero=False
)(_input)
activations = LSTM(units, return_sequences=True)(embedded)
# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)
probabilities = Dense(3, activation='softmax')(sent_representation)
Is this the correct way to do it? i was sort of expecting the existence of time distributed layer since attention mechanism is distributed in every time step of the RNN. I need someone to confirm that this implementation(the code) is a correct implementation of attention mechanism. Thank you.
Upvotes: 23
Views: 46337
Reputation: 1374
While many good alternatives are given, I have tried to modify the code YOU have shared to make it work. I have also answered your other query that has not been addressed so far:
Q1. Is this the correct way to do it? The attention layer itself looks good. No changes needed. The way you have used the output of the attention layer can be slightly simplified and modified to incorporate some recent framework upgrades.
sent_representation = merge.Multiply()([activations, attention])
sent_representation = Lambda(lambda xin: K.sum(xin, axis=1))(sent_representation)
You are now good to go!
Q2. i was sort of expecting the existence of time distributed layer since attention mechanism is distributed in every time step of the RNN
No, you don't need a time distributed layer else the weights would be shared across timesteps which is not what you want.
You can refer to: https://towardsdatascience.com/create-your-own-custom-attention-layer-understand-all-flavours-2201b5e8be9e for other specific details
Upvotes: 1
Reputation: 3095
I think you can try the following code to add keras self-attention mechanism with LSTM network
from keras_self_attention import SeqSelfAttention
inputs = Input(shape=(length,))
embedding = Embedding(vocab_size, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False)(inputs)
lstm = LSTM(num_lstm, input_shape=(X[train].shape[0], X[train].shape[1]), return_sequences=True)(embedding)
attn = SeqSelfAttention(attention_activation='sigmoid')(lstm)
Flat = Flatten()(attn)
dense = Dense(32, activation='relu')(Flat)
outputs = Dense(3, activation='sigmoid')(dense)
model = Model(inputs=[inputs], outputs=outputs)
model.compile(loss='binary_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val,y_val), shuffle=True)
Upvotes: 0
Reputation: 4170
Recently I was working with applying attention mechanism on a dense layer and here is one sample implementation:
def build_model():
input_dims = train_data_X.shape[1]
inputs = Input(shape=(input_dims,))
dense1800 = Dense(1800, activation='relu', kernel_regularizer=regularizers.l2(0.01))(inputs)
attention_probs = Dense( 1800, activation='sigmoid', name='attention_probs')(dense1800)
attention_mul = multiply([ dense1800, attention_probs], name='attention_mul')
dense7 = Dense(7, kernel_regularizer=regularizers.l2(0.01), activation='softmax')(attention_mul)
model = Model(input=[inputs], output=dense7)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
print (model.summary)
model.fit( train_data_X, train_data_Y_, epochs=20, validation_split=0.2, batch_size=600, shuffle=True, verbose=1)
Upvotes: 3
Reputation: 1250
Attention mechanism pays attention to different part of the sentence:
activations = LSTM(units, return_sequences=True)(embedded)
And it determines the contribution of each hidden state of that sentence by
attention = Dense(1, activation='tanh')(activations)
attention = Activation('softmax')(attention)
And finally pay attention to different states:
sent_representation = merge([activations, attention], mode='mul')
I don't quite understand this part: sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)
To understand more, you can refer to this and this, and also this one gives a good implementation, see if you can understand more on your own.
Upvotes: 2
Reputation: 1821
If you want to have an attention along the time dimension, then this part of your code seems correct to me:
activations = LSTM(units, return_sequences=True)(embedded)
# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
sent_representation = merge([activations, attention], mode='mul')
You've worked out the attention vector of shape (batch_size, max_length)
:
attention = Activation('softmax')(attention)
I've never seen this code before, so I can't say if this one is actually correct or not:
K.sum(xin, axis=-2)
Further reading (you might have a look):
Upvotes: 19