Reputation: 87
I am trying to implement a classification head for the reformer transformer. The classification head works fine, but when I try to change one of the config parameters- config.axial_pos_shape i.e sequence length parameter for the model it throws an error;
size mismatch for reformer.embeddings.position_embeddings.weights.0: copying a param with shape torch.Size([512, 1, 64]) from checkpoint, the shape in current model is torch.Size([64, 1, 64]). size mismatch for reformer.embeddings.position_embeddings.weights.1: copying a param with shape torch.Size([1, 1024, 192]) from checkpoint, the shape in current model is torch.Size([1, 128, 192]).
The config:
{
"architectures": [
"ReformerForSequenceClassification"
],
"attention_head_size": 64,
"attention_probs_dropout_prob": 0.1,
"attn_layers": [
"local",
"lsh",
"local",
"lsh",
"local",
"lsh"
],
"axial_norm_std": 1.0,
"axial_pos_embds": true,
"axial_pos_embds_dim": [
64,
192
],
"axial_pos_shape": [
64,
256
],
"chunk_size_feed_forward": 0,
"chunk_size_lm_head": 0,
"eos_token_id": 2,
"feed_forward_size": 512,
"hash_seed": null,
"hidden_act": "relu",
"hidden_dropout_prob": 0.05,
"hidden_size": 256,
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": true,
"layer_norm_eps": 1e-12,
"local_attention_probs_dropout_prob": 0.05,
"local_attn_chunk_length": 64,
"local_num_chunks_after": 0,
"local_num_chunks_before": 1,
"lsh_attention_probs_dropout_prob": 0.0,
"lsh_attn_chunk_length": 64,
"lsh_num_chunks_after": 0,
"lsh_num_chunks_before": 1,
"max_position_embeddings": 8192,
"model_type": "reformer",
"num_attention_heads": 2,
"num_buckets": [
64,
128
],
"num_chunks_after": 0,
"num_chunks_before": 1,
"num_hashes": 1,
"num_hidden_layers": 6,
"output_past": true,
"pad_token_id": 0,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 100
}
},
"vocab_size": 320
}
Python Code:
config = ReformerConfig()
config.max_position_embeddings = 8192
config.axial_pos_shape=[64, 128]
#config = ReformerConfig.from_pretrained('./cnp/config.json', output_attention=True)
model = ReformerForSequenceClassification(config)
model.load_state_dict(torch.load("./cnp/pytorch_model.bin"))
Upvotes: 3
Views: 6962
Reputation: 41
I run into the same issue, trying to halve the size of the 65536 (128*512) by default max sequence length used in Reformer pre-training.
As @cronoik mentioned, you must:
Those unnecessary weights are the ones from the Position Embeddings layer. In Reformer model, the Axial Position Encodings strategy was used to learn the position embeddings (rather than having fixed ones like BERT). Axial Position Encodings stores position embeddings in a memory efficient manner, using two small tensors rather than a big one.
However, the idea of position embeddings remains exactly the same, which is obtaining different embeddings for each position.
That said, in theory (correct me if I am misunderstanding somewhere), removing the last position embeddings to match your custom max sequence length should not hurt the performance. You can refer to this post from HuggingFace to see a more detailed description of Axial Position Encodings and understand where to truncate your position embeddings tensor.
I have managed to resize and use Reformer with a custom max length of 32768 (128*256) with the following code:
# Load intial pretrained model
model = ReformerForSequenceClassification.from_pretrained('google/reformer-enwik8', num_labels=2)
# Reshape Axial Position Embeddings layer to match desired max seq length
model.reformer.embeddings.position_embeddings.weights[1] = torch.nn.Parameter(model.reformer.embeddings.position_embeddings.weights[1][0][:256])
# Update the config file to match custom max seq length
model.config.axial_pos_shape = 128, 256
model.config.max_position_embeddings = 128*256 # 32768
# Save model with custom max length
output_model_path = "path/to/model"
model.save_pretrained(output_model_path)
Upvotes: 2