sachinruk
sachinruk

Reputation: 9869

some parameters appear in more than one parameter group

When I try the below I get the error ValueError: some parameters appear in more than one parameter group. However, inspecting the model it is not clear to me what is the overlapping module.

The only possiblity as to why it might think so may be because lm_head and transformer.wte have parameters named weight. I'm wondering if this name is what is causing this error.

I am doing this so that I can have the lower layers "moving slowly" compared to the upper layers. Happy to hear if there is an alternative way to do these discriminative learning rates where I don't have overlapping parameters (if any).

import torch
from transformers import AutoModelForCausalLM

language_model = AutoModelForCausalLM.from_pretrained("gpt2")
FREEZE_LAYERS = 2

caption_params = [
        {"params": language_model.lm_head.parameters() , "lr": 1e-4},
        {"params": language_model.transformer.ln_f.parameters() , "lr": 1e-4},
        {"params": language_model.transformer.h[FREEZE_LAYERS:].parameters() , "lr": 5e-5},
        {"params": language_model.transformer.wte.parameters() , "lr": 1e-5},
]
optimizer = torch.optim.Adam(caption_params)

Upvotes: 3

Views: 3506

Answers (1)

Luke G
Luke G

Reputation: 176

The error message is diagnosing the problem correctly: there are some parameters that appear in more than one parameter group. You can prove this to yourself by doing the following:

>>> parameter_ids = [[id(p) for p in group["params"]] for group in caption_params]
>>> parameter_ids[0]
[140666221372896]
>>> parameter_ids[3]
[140666221372896]

This reveals that the first and last parameter groups, each of which contains a single large embedding tensor, are actually holding a reference to the same exact tensor. What is this tensor? Let's look at it, using both routes of reference to further show it's the same thing:

>>> a = next(language_model.lm_head.parameters())
>>> a
Parameter containing:
tensor([[-0.1101, -0.0393,  0.0331,  ..., -0.1364,  0.0151,  0.0453],
        [ 0.0403, -0.0486,  0.0462,  ...,  0.0861,  0.0025,  0.0432],
        [-0.1275,  0.0479,  0.1841,  ...,  0.0899, -0.1297, -0.0879],
        ...,
        [-0.0445, -0.0548,  0.0123,  ...,  0.1044,  0.0978, -0.0695],
        [ 0.1860,  0.0167,  0.0461,  ..., -0.0963,  0.0785, -0.0225],
        [ 0.0514, -0.0277,  0.0499,  ...,  0.0070,  0.1552,  0.1207]],
       requires_grad=True)
>>> b = next(language_model.transformer.wte.parameters())
>>> b
Parameter containing:
tensor([[-0.1101, -0.0393,  0.0331,  ..., -0.1364,  0.0151,  0.0453],
        [ 0.0403, -0.0486,  0.0462,  ...,  0.0861,  0.0025,  0.0432],
        [-0.1275,  0.0479,  0.1841,  ...,  0.0899, -0.1297, -0.0879],
        ...,
        [-0.0445, -0.0548,  0.0123,  ...,  0.1044,  0.0978, -0.0695],
        [ 0.1860,  0.0167,  0.0461,  ..., -0.0963,  0.0785, -0.0225],
        [ 0.0514, -0.0277,  0.0499,  ...,  0.0070,  0.1552,  0.1207]],
       requires_grad=True)
>>> a is b
True

This makes sense, because many Transformer-based models tie the weights used in mapping between word IDs and word representations at the beginning (the initial Embedding layer) and end (the LM head) of the model.

For your specific problem, you can either accept that the tied weights will be moving at the same LR, or you can untie them by cloning and assigning a new copy of the parameter to one of the two modules.

Upvotes: 4

Related Questions