Reputation: 2850
I am new to PyTorch and getting used to some concepts.
I need to train a Neural Network. For optimization, I need to use Adam optimizer with 4 different learning rates = [2e-5, 3e-5, 4e-5, 5e-5]
The optimizer function is defined as below
def optimizer(no_decay = ['bias', 'gamma', 'beta'], lr=2e-5):
param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.0}
]
# This variable contains all of the hyperparemeter information our training loop needs
optimizer = BertAdam(optimizer_grouped_parameters, lr, warmup=.1)
return optimizer
How do I make sure the optimizer uses my specified set of learning rate and returns the best model?
During training, we use the optimizer like below where I don't see a way to tell it to try different learning rate
def model_train():
#other code
# clear out the gradient
optimizer.zero_grad()
# Forward pass
loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
train_loss_set.append(loss.item())
# Backward pass
loss.backward()
# Update parameters and take a step using the computed gradient
optimizer.step()
I know that the optimizer.step()
internally steps through to optimize the gradient. But how do I make sure the optimizer tries my specified set of learning rate and returns the best model to me?
Please suggest.
Upvotes: 1
Views: 3556
Reputation: 114786
If you want to train four times with four different learning rates and then compare you need not only four optimizer
s but also four model
s: Using different learning rate (or any other meta-parameter for this matter) yields a different trajectory of the weights in the high-dimensional "parameter space". That is, after a few steps its not only the learning rate that differentiate between the models, but the trained weights themselves - this is what yield the actual difference between the models.
therefore, you need to train 4 times using 4 separate instances of model
using 4 instances of optimizer
with different learning rates.
Upvotes: 1