Chiara
Chiara

Reputation: 490

How to read a BERT attention weight matrix?

I have extracted from the last layer and the last attention head of my BERT model the attention score/weights matrix. However I am not too sure how to read them. The matrix is the following one. I tried to find some more information in the literature but it was not successful. Any insights? Since the matrix is not symmetric and each rows sums to 1, I am confused. Thanks a lot !

enter image description here

  tokenizer = BertTokenizer.from_pretrained('Rostlab/prot_bert')
  inputs = tokenizer(input_text, return_tensors='pt') 
  attention_mask=inputs['attention_mask']
  outputs = model(inputs['input_ids'],attention_mask) #len 30 as the model layers #outpus.attentions
  attention = outputs[-1]
attention = attention[-1] #last layer attention
layer_attention = layer_attention[-1] #last head attention
#... code to read it as a matrix with token labels

Upvotes: 3

Views: 4032

Answers (2)

Allohvk
Allohvk

Reputation: 1366

The key insight is that - self attention is not symmetric in nature! This may look surprising since we may intuitively assume that Word_2 attends to Word_5 with the same amount of affection as the opposite ie Word_5's attention to Word_2. But that does not happen. The attention softmax output is NOT symmetric as OP rightly printed out.

Now why would we (Vaswani & co) design such a thing and how was it done? The 'How' is simple. The self attention is a measure of Q*K_transpose. Q and K are different for the SAME word. Hence the relationship is not symmetric (at least as long as keys and values are different). To be more specific:

Q_Word2 * K_Word5_transpose is not equal to Q_Word5 * K_Word2_transpose

(Note: There is a paper where they have same values for keys and values & by that I mean the Weight matrices that transform the inp word into key and value are the same..so Q=K and it naturally follows that QK_transpose is same as KQ_transpose. so for that paper and that particular special design, attention probabilities would be symmetric).

Now coming to the WHY part. Why did Vaswani and co design it that way? It was a design choice that probably epirically worked well in their internal experiments. Queries and Keys are slightly diff entities. The same word may behave slightly differently when it is used as a key and when it is used as a query. This concept itself comes from Information retreival systems. Think of a google search where you enter a query... the query is encoded and searched against a database of keys. If a match is found, the (more detailed) Value is retreived and shown to the user. There are benefits to separating the roles of the keys from the queries instead of clubbing them. Then each has the potential to express itself better and play the role that it is supposed to play in a better fashion. Think of a key and query as interlocking parts that literally work best together - The query opens the best-fitting lock and fetches the value inside. Simply having the query itself as a key also may be sub-optimal.

Upvotes: 0

amiola
amiola

Reputation: 3036

The attention matrix is asymmetric because query and key matrices differ.

At its core (leaving normalization constants and the multi-head trick aside) (dot-product) self-attention is computed as follows:

  1. Compute key-query affinities (e_ij): given (T being the sequence length, q_i and k_j being query and key vectors)

  2. Compute attention weights from affinities (alpha_ij):

As you can see, you get the normalization of the affinities by summing over all keys given a query; said differently, in the denominator you're summing affinities by row (thus, probabilities sum to 1 over rows).

The way you should read the attention matrix is the following: row tokens (queries) attend to column tokens (keys) and the matrix weights represent a way to probabilistically measure where attention is directed to when querying over keys (i.e. to which key - and so to which token of the sentence - each query (token) mainly focuses to). Such interaction is unidirectional (you might look at each query as looking for information somewhere in the keys, the opposite interaction being irrelevant). I found the interpretation of the attention matrix as a directed graph within this blogpost very effective.

Eventually, I'd also suggest the first BertViz medium post which distinguishes different attention patterns and according to which your example would fall in the case where attention is mostly directed to the delimiter token [CLS].

Upvotes: 5

Related Questions