MAC
MAC

Reputation: 1533

How to change parameters of pre-trained longformer model from huggingface

I am using Hugging-face pre-trained LongformerModel model. I am using to extract embedding for sentence. I want to change the token length, max sentence length parameter but I am not able to do so. Here is the code.

model = LongformerModel.from_pretrained('allenai/longformer-base-4096',output_hidden_states = True)
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

model.eval()

text=[" I like to play cricket"]

input_ids = torch.tensor(tokenizer.encode(text,max_length=20,padding=True,add_special_tokens=True)).unsqueeze(0)

print(tokenizer.encode(text,max_length=20,padding=True,add_special_tokens=True))

# [0, 38, 101, 7, 310, 5630, 2]

I expected encoder to give me list of size 20 with padding as I have passed a parameter max_length=20. But it returned list of size 7 only?

attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)
attention_mask[:, [0,-1]] = 2
outputs = model(input_ids, attention_mask=attention_mask, return_dict=True)

hidden_states = outputs[2]

print ("Number of layers:", len(hidden_states), "  (initial embeddings + 12 BERT layers)")
        layer_i = 0

print ("Number of batches:", len(hidden_states[layer_i]))
        batch_i = 0

print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
        token_i = 0

print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))

Output:

Number of layers: 13   (initial embeddings + 12 BERT layers)
Number of batches: 1
Number of tokens: 512 # How can I change this parameter to pick up my sentence length during run-time
Number of hidden units: 768

How can I reduce number of tokens to sentence length instead of 512 ? Every-time I input a new sentence, it should pick up that length.

Upvotes: 1

Views: 1254

Answers (1)

cronoik
cronoik

Reputation: 19510

Question regarding padding

padding=True pads your input to the longest sequence. padding=max_length pads your input to the specified max_length (documentation):

from transformers import LongformerTokenizer

tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
text=[" I like to play cricket"]
print(tokenizer.encode(text[0],max_length=20,padding='max_length',add_special_tokens=True))

Output:

[0, 38, 101, 7, 310, 5630, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Question regarding the number of tokens of the hidden states

The Longformer implementation applies padding to your sequence to match the attention window sizes. You can see the size of the attention windows in your model config:

model.config.attention_window

Output:

[512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512]

This is the corresponding code line: link.

Upvotes: 1

Related Questions