Reputation: 490
I have extracted from the last layer and the last attention head of my BERT model the attention score/weights matrix. However I am not too sure how to read them. The matrix is the following one. I tried to find some more information in the literature but it was not successful. Any insights? Since the matrix is not symmetric and each rows sums to 1, I am confused. Thanks a lot !
tokenizer = BertTokenizer.from_pretrained('Rostlab/prot_bert')
inputs = tokenizer(input_text, return_tensors='pt')
attention_mask=inputs['attention_mask']
outputs = model(inputs['input_ids'],attention_mask) #len 30 as the model layers #outpus.attentions
attention = outputs[-1]
attention = attention[-1] #last layer attention
layer_attention = layer_attention[-1] #last head attention
#... code to read it as a matrix with token labels
Upvotes: 3
Views: 4032
Reputation: 1366
The key insight is that - self attention is not symmetric in nature! This may look surprising since we may intuitively assume that Word_2 attends to Word_5 with the same amount of affection as the opposite ie Word_5's attention to Word_2. But that does not happen. The attention softmax output is NOT symmetric as OP rightly printed out.
Now why would we (Vaswani & co) design such a thing and how was it done? The 'How' is simple. The self attention is a measure of Q*K_transpose. Q and K are different for the SAME word. Hence the relationship is not symmetric (at least as long as keys and values are different). To be more specific:
Q_Word2 * K_Word5_transpose is not equal to Q_Word5 * K_Word2_transpose
(Note: There is a paper where they have same values for keys and values & by that I mean the Weight matrices that transform the inp word into key and value are the same..so Q=K and it naturally follows that QK_transpose is same as KQ_transpose. so for that paper and that particular special design, attention probabilities would be symmetric).
Now coming to the WHY part. Why did Vaswani and co design it that way? It was a design choice that probably epirically worked well in their internal experiments. Queries and Keys are slightly diff entities. The same word may behave slightly differently when it is used as a key and when it is used as a query. This concept itself comes from Information retreival systems. Think of a google search where you enter a query... the query is encoded and searched against a database of keys. If a match is found, the (more detailed) Value is retreived and shown to the user. There are benefits to separating the roles of the keys from the queries instead of clubbing them. Then each has the potential to express itself better and play the role that it is supposed to play in a better fashion. Think of a key and query as interlocking parts that literally work best together - The query opens the best-fitting lock and fetches the value inside. Simply having the query itself as a key also may be sub-optimal.
Upvotes: 0
Reputation: 3036
The attention matrix is asymmetric because query
and key
matrices differ.
At its core (leaving normalization constants and the multi-head trick aside) (dot-product) self-attention is computed as follows:
Compute key-query affinities (e_ij
): given (
T
being the sequence length, q_i
and k_j
being query and key vectors)
Compute attention weights from affinities (alpha_ij
):
As you can see, you get the normalization of the affinities by summing over all keys given a query; said differently, in the denominator you're summing affinities by row (thus, probabilities sum to 1 over rows).
The way you should read the attention matrix is the following: row tokens (queries) attend to column tokens (keys) and the matrix weights represent a way to probabilistically measure where attention is directed to when querying over keys (i.e. to which key - and so to which token of the sentence - each query (token) mainly focuses to). Such interaction is unidirectional (you might look at each query as looking for information somewhere in the keys, the opposite interaction being irrelevant). I found the interpretation of the attention matrix as a directed graph within this blogpost very effective.
Eventually, I'd also suggest the first BertViz medium post which distinguishes different attention patterns and according to which your example would fall in the case where attention is mostly directed to the delimiter token [CLS].
Upvotes: 5