Reputation: 1533
I am using Hugging-face pre-trained LongformerModel
model. I am using to extract embedding for sentence. I want to change the token length
, max sentence length
parameter but I am not able to do so. Here is the code.
model = LongformerModel.from_pretrained('allenai/longformer-base-4096',output_hidden_states = True)
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model.eval()
text=[" I like to play cricket"]
input_ids = torch.tensor(tokenizer.encode(text,max_length=20,padding=True,add_special_tokens=True)).unsqueeze(0)
print(tokenizer.encode(text,max_length=20,padding=True,add_special_tokens=True))
# [0, 38, 101, 7, 310, 5630, 2]
I expected encoder to give me list of size 20 with padding as I have passed a parameter max_length=20.
But it returned list of size 7 only?
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)
attention_mask[:, [0,-1]] = 2
outputs = model(input_ids, attention_mask=attention_mask, return_dict=True)
hidden_states = outputs[2]
print ("Number of layers:", len(hidden_states), " (initial embeddings + 12 BERT layers)")
layer_i = 0
print ("Number of batches:", len(hidden_states[layer_i]))
batch_i = 0
print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
token_i = 0
print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))
Output:
Number of layers: 13 (initial embeddings + 12 BERT layers)
Number of batches: 1
Number of tokens: 512 # How can I change this parameter to pick up my sentence length during run-time
Number of hidden units: 768
How can I reduce number of tokens to sentence length instead of 512 ? Every-time I input a new sentence, it should pick up that length.
Upvotes: 1
Views: 1254
Reputation: 19510
Question regarding padding
padding=True
pads your input to the longest sequence. padding=max_length
pads your input to the specified max_length (documentation):
from transformers import LongformerTokenizer
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
text=[" I like to play cricket"]
print(tokenizer.encode(text[0],max_length=20,padding='max_length',add_special_tokens=True))
Output:
[0, 38, 101, 7, 310, 5630, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Question regarding the number of tokens of the hidden states
The Longformer implementation applies padding to your sequence to match the attention window sizes. You can see the size of the attention windows in your model config:
model.config.attention_window
Output:
[512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512]
This is the corresponding code line: link.
Upvotes: 1