MultiHeadAttention attention_mask [Keras, Tensorflow] example

Question

I am struggling to mask my input for the MultiHeadAttention Layer. I am using the Transformer Block from Keras documentation with self-attention. I could not find any example code online so far and would appreciate if someone could give me a code snippet.

The transformer block from this page:

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

The documentation for masking one can find under this link:

attention_mask: a boolean mask of shape [B, T, S], that prevents attention to certain positions. The boolean mask specifies which query elements can attend to which key elements, 1 indicates attention and 0 indicates no attention. Broadcasting can happen for the missing batch dimensions and the head dimension.

The only thing, I could get running is a mask created outside of the layer class as numpy array:

mask = np.ones((observations, sequence_length, sequence_length))
mask[X[:observations,:,0]==0]=0

Then input while calling the layer, with the only change in the transformer block being:

def call(self, inputs, mask, training):
    attn_output = self.att(inputs, inputs, attention_mask=mask)

However, this does of course not work when given a batch_size while fitting and does only work for 5 observations with my memory, so it doesn't make any sense. Apart from that, I don't think this is masking the input properly - In general I am quite confused about how to mask, given the shape of the attention_mask (observations, sequence_length, sequence_length). The shape of my input is (observation, sequence_length, features). This input is being padded by zeros, however, when it comes to the transformer block, it has been already through an embedding layer and CNN. I have tried various ways to write a function, which creates the mask while training with different Tensor or Keras objects. However I am running each time into errors.

I hope someone more fluent in Tensorflow/Keras will be able to provide an example. Or somebody tells me that masking is useless given my architecture. The model is performing well. However, I hoped masking could help speed up the computing. And it just buggs me that I cannot get my head around it.

MultiHeadAttention attention_mask [Keras, Tensorflow] example

Answers (1)

Related Questions