sk-19
sk-19

Reputation: 13

Attention Mechanism Scores are the same

Problem Statement:

I am currently working on Aspect-Based Sentiment Analysis, where the objective is to analyze changing sentiment trends within a sentence by employing temporal windows. Ultimately, I aim to develop a contrastive learning model. To achieve this, I am utilizing a pre-trained RoBERTa transformer model along with attention mechanisms.

For instance, given the sentence: "Battery life is good, but camera is very bad"

With the aspects being "Battery life" and "camera," the dataset is structured as follows:

Index Sentence Aspect Polarity
1 "Battery life is good, but camera is very bad" Battery Life Positive
2 "Battery life is good, but camera is very bad" Camera Negative

To obtain sentence embeddings, I employ a pre-trained RoBERTa transformer model. Then, I intend to pass these embeddings through an attention mechanism to obtain embeddings that are 'aspect-aware'. In essence, I aim for the attention mechanism to discern between words in the sentence based on their respective aspects - how each word's relative importance differs according to the aspects (index 1 sentence word scores will be different from index 2 word scores; each will be optimized according to the aspect)

Tokenizers used:

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
class MultiheadAttentionWithAspect(nn.Module):
    def __init__(self, input_dim, d_model, aspect_embedding_dim, num_heads):
        super(MultiheadAttentionWithAspect, self).__init__()
        self.input_dim = input_dim
        self.d_model = d_model
        self.num_heads = num_heads
        self.aspect_embedding_dim = aspect_embedding_dim

        self.qkv_layer = nn.Linear(d_model, 3 * d_model)  # Changed input_dim to d_model
        self.aspect_linear = nn.Linear(aspect_embedding_dim, d_model)  # Linear layer for aspect embeddings
        self.linear_layer = nn.Linear(d_model, d_model)

    def forward(self, x, aspect_embeddings, mask=None):
        batch_size, sequence_length, input_dim = x.size()
        aspect_embeddings_expanded = self.aspect_linear(aspect_embeddings).repeat(1, sequence_length, 1)

        qkv = self.qkv_layer(x)
        q, k, v = qkv.chunk(3, dim=-1)

        output_features = d_model // num_heads  # Calculate desired output features

        aspect_embeddings_projected = F.linear(aspect_embeddings_expanded, self.aspect_linear.weight)
        aspect_weights = F.sigmoid(aspect_embeddings_projected)  # Sigmoid activation for gating
        x_with_aspect = x * aspect_weights

        k_concat = k + aspect_embeddings_projected  # Element-wise addition
        v_concat = v + aspect_embeddings_projected

        values, attention = scaled_dot_product(x_with_aspect, k_concat, v_concat, mask)

        values = F.dropout(values, p=0.1, training=self.training)


        output = self.linear_layer(values)
        return output, attention

This class is called like this:

input_dim = 768  # Dimensionality of input embeddings
d_model = 768  # Dimensionality of the model
num_heads = 8  # Number of attention heads

batch_size = 32
sequence_length = 398
aspect_embedding_dim = 768
x = sentence_embeddings

# Initialize encoder self-attention module
encoder_self_attention = MultiheadAttentionWithAspect(input_dim, d_model,aspect_embedding_dim, num_heads)

# Forward pass
output,attention = encoder_self_attention(x,aspect_embeddings,mask)
print("Attention:", attention)

Issue is, for the following sentence and aspect, I retrieve the sentence embeddings and aspect embeddings, and then when i retrieve my attention scores for the sentence, all values are the same:

sentence = "The food at this restaurant is delicious."
aspect = "food"
max_seq_length= 400 #longest sentence in df

sentence_embeddings,aspect_embeddings, input_ids = tokenize_and_contextualize(sentence, aspect, tokenizer, model, max_seq_length)

# Initialize encoder self-attention module
encoder_self_attention = MultiheadAttentionWithAspect(input_dim, d_model,aspect_embedding_dim, num_heads)

# Forward pass
output,attention = encoder_self_attention(x,aspect_embeddings,mask)

Attention Scores when printed out:


Attention Scores: tensor([[[0.0026, 0.0025, 0.0026,  ..., 0.0025, 0.0025, 0.0025],
         [0.0026, 0.0024, 0.0025,  ..., 0.0025, 0.0025, 0.0025],
         [0.0026, 0.0024, 0.0025,  ..., 0.0025, 0.0025, 0.0025],
         ...,
         [0.0027, 0.0026, 0.0026,  ..., 0.0025, 0.0025, 0.0025],
         [0.0027, 0.0026, 0.0026,  ..., 0.0025, 0.0025, 0.0025],
         [0.0027, 0.0026, 0.0026,  ..., 0.0025, 0.0025, 0.0025]]],
       grad_fn=<SoftmaxBackward0>)

Output:

Output: tensor([[[-0.1860,  0.0446, -0.0739,  ..., -0.0274,  0.0770,  0.0683],
         [-0.1819,  0.0725,  0.0731,  ..., -0.0391,  0.0886,  0.0440],
         [-0.2937, -0.0031, -0.0294,  ...,  0.1002,  0.1164,  0.0995],
         ...,
         [-0.1288,  0.1084, -0.1185,  ...,  0.2159,  0.1058,  0.1133],
         [-0.2388,  0.1785, -0.0160,  ...,  0.0545,  0.1429,  0.0658],
         [-0.2878, -0.0587, -0.0592,  ...,  0.0542,  0.1209,  0.0370]]],
       grad_fn=<ViewBackward0>)

Any help on the following will be helpful:

  1. Is the attention mechanism architecture alright?

  2. Any insight on why are all the attention scores the same?

  3. A better way to inject the 'aspect' information into the sentence embedding instead of concatenating them

  4. Improvements to the current model keeping in mind the goal is to learn multi-aspects in a sentence


Thank you for your help.

I expect words relating to the aspect, and the aspect itself will get high scores.

Upvotes: 0

Views: 60

Answers (0)

Related Questions