Reputation: 904
HuggingFace's get_linear_schedule_with_warmup takes as arguments:
And in the guide on a full training process, with a similar scheduler, they state:
To properly define [the scheduler], we need to know the number of training steps we will take, which is the number of epochs we want to run multiplied by the number of training batches (which is the length of our training dataloader).
I want to follow an implementation from a research paper in which they apply linear learning rate warm-up during the first 10% of the updates followed by a linear decay.
I was a bit confused in the wording in "first 10% of the updates", would this correspond to 10% over the entirety of training? Am I right in assuming that, since num_training_steps is based off of the number of epochs multiplied by the number of batches, then num_warmup_steps = number of batches * number of epochs * 0.1?
Upvotes: 3
Views: 4326
Reputation: 19495
I also see it the same way as you, the first 10% of the updates refer to the total number of training steps.
Commonly, a formula like this is used to get the number of total training steps:
t_total = len(train_dataloader) // num_of_epochs * gradient_accumulation_steps
Upvotes: 1