nad
nad

Reputation: 2850

Training a model with multiple learning rate in PyTorch

I am new to PyTorch and getting used to some concepts.

I need to train a Neural Network. For optimization, I need to use Adam optimizer with 4 different learning rates = [2e-5, 3e-5, 4e-5, 5e-5]

The optimizer function is defined as below

def optimizer(no_decay = ['bias', 'gamma', 'beta'], lr=2e-5):
    param_optimizer = list(model.named_parameters())
    optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
     ]
     # This variable contains all of the hyperparemeter information our training loop needs
     optimizer = BertAdam(optimizer_grouped_parameters, lr, warmup=.1)
     return optimizer

How do I make sure the optimizer uses my specified set of learning rate and returns the best model?

During training, we use the optimizer like below where I don't see a way to tell it to try different learning rate

def model_train():
    #other code
    # clear out the gradient
    optimizer.zero_grad()
    # Forward pass
    loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    train_loss_set.append(loss.item())
    # Backward pass
    loss.backward()
    # Update parameters and take a step using the computed gradient
    optimizer.step()

I know that the optimizer.step() internally steps through to optimize the gradient. But how do I make sure the optimizer tries my specified set of learning rate and returns the best model to me?

Please suggest.

Upvotes: 1

Views: 3556

Answers (1)

Shai
Shai

Reputation: 114786

If you want to train four times with four different learning rates and then compare you need not only four optimizers but also four models: Using different learning rate (or any other meta-parameter for this matter) yields a different trajectory of the weights in the high-dimensional "parameter space". That is, after a few steps its not only the learning rate that differentiate between the models, but the trained weights themselves - this is what yield the actual difference between the models.

therefore, you need to train 4 times using 4 separate instances of model using 4 instances of optimizer with different learning rates.

Upvotes: 1

Related Questions