oooliverrr
oooliverrr

Reputation: 98

Inconsistent vector representation using transformers BertModel and BertTokenizer

I have a BertTokenizer (tokenizer) and a BertModel (model) from the transformers library. I have pre-trained the model from scratch with a few wikipedia articles, just to test how it works.

Once the model is pre-trained, I want to extract a layer vector representation for a given sentence. For that, I calculate the average of the 11 hidden (768-sized) vectors. I do this as follows (line is a single String):

padded_sequence = tokenizer(line, padding=True)
        
indexed_tokens = padded_sequence['input_ids']
attention_mask = padded_sequence["attention_mask"]

tokens_tensor = torch.tensor([indexed_tokens])
attention_mask_tensor = torch.tensor([attention_mask])

outputs = model(tokens_tensor, attention_mask_tensor)
hidden_states = outputs[0]

line_vectorized = hidden_states[0].data.numpy().mean(axis=0)

So far so good. I can do this for every sentence individually. But now I want to do it in batch, ie. I have a bunch of sentences and instead of iterating each sentence I send the appropiate tensor representations to get all vectors at once. I do this as follows (lines is a list of Strings):

padded_sequences = self.tokenizer_PYTORCH(lines, padding=True)
        
indexed_tokens_list = padded_sequences['input_ids']
attention_mask_list = padded_sequences["attention_mask"]
        
tokens_tensors_list = [torch.tensor([indexed_tokens]) for indexed_tokens in indexed_tokens_list]
attention_mask_tensors_list = [torch.tensor([attention_mask ]) for attention_mask in attention_mask_list ]
        
tokens_tensors = torch.cat((tokens_tensors_list), 0)
attention_mask_tensors = torch.cat((attention_mask_tensors_list ), 0)

outputs = model(tokens_tensors, attention_mask_tensors)
hidden_states = outputs[0]

lines_vectorized = [hidden_states[i].data.numpy().mean(axis=0) for i in range(0, len(hidden_states))]

The problem is the following: I have to use padding so that I can appropiately concatenate the token tensors. That means that the indexed tokens and the attention masks can be larger than in the previous case where the sentences were evaluated individually. But when I use padding, I get different results for the sentences which have been padded.

EXAMPLE: I have two sentences (in French but it doesn't matter):

sentence_A = "appareil digestif un article de wikipedia l encyclopedie libre"

sentence_B = "sauter a la navigation sauter a la recherche cet article est une ebauche concernant la biologie"

When I evaluate the two sentences individually, I obtain:

sentence_A:

indexed_tokens =  [10002, 3101, 4910, 557, 73, 3215, 9630, 2343, 4200, 8363, 10000]
attention_mask =  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
line_vectorized =  [-0.9304411   0.53798294 -1.6231083 ...]

sentence_B:

indexed_tokens =  [10002, 2217, 6496, 1387, 9876, 2217, 6496, 1387, 4441, 405, 73, 6451, 3, 2190, 5402, 1387, 2971, 10000]
attention_mask =  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
line_vectorized =  [-0.8077076   0.56028104 -1.5135447  ...]

But when I evaluate the two sentences in batch, I obtain:

sentence_A:

indexed_tokens =  [10002, 3101, 4910, 557, 73, 3215, 9630, 2343, 4200, 8363, 10000, 10004, 10004, 10004, 10004, 10004, 10004, 10004]
attention_mask =  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
line_vectorized =  [-1.0473819   0.6090186  -1.727466  ...]

sentence_B:

indexed_tokens =  [10002, 2217, 6496, 1387, 9876, 2217, 6496, 1387, 4441, 405, 73, 6451, 3, 2190, 5402, 1387, 2971, 10000]
attention_mask =  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
line_vectorized =  [-0.8077076   0.56028104 -1.5135447  ...]

That is, since sentence_B is larger than sentence_A, sentence_A has been padded and the attention mask has been padded with zeros as well. The indexed tokens contain now extra tokens (10004 which I assume empty). The vector representation of sentence_B has NOT changed. But the vector representation of sentence_A HAS CHANGED.

I would like to know if this is working as intended or not (I assume not). And I guess I am doing something wrong but I can't figure out what.

Any ideas?

Upvotes: 1

Views: 610

Answers (1)

Ashwin Geet D'Sa
Ashwin Geet D'Sa

Reputation: 7369

When you do it in single sentence per batch, the maximum length of the sentence is maximum number of tokens, however, when you do it in batch the maximum length of the sentences remains the same across the batch, which defaults to the maximum number of tokens in the longest sentence. The max values of 1 in this case indicates its not a <PAD> token, and 0 indicates a <PAD> token. The best way to control this is to define the maximum sequence length and truncate the sentences longer than the maximum sequence length.

This can be done using an alternative method to tokenize the text in batches (a single sentence can be considered as batch-size of 1):

tokenizer = BertTokenizer.from_pretrained("<your bert model>", do_lower_case=True)
encoding = tokenizer.batch_encode_plus(lines, return_tensors='pt',padding=True, truncation=True, max_length=50, add_special_tokens = True) ## Change the max_length to the required max length
indexed_tokens = encoding['input_ids']
attention_mask = encoding['attention_mask']

Upvotes: 1

Related Questions