Reputation: 389
I've found something quite strange when using Huggingface Transformers with a custom training loop in PyTorch.
But first, some context: I'm currently trying to fine tune a pretrained GPT2 small (GPT2LMHeadModel; the ~170M param version) on multiple nodes, using Huggingface Accelerate. I'm using Huggingface's datasets
library for training.
Of course, the first step in this process in accelerate is to write a custom PyTorch training loop, which I did with the help of the official tutorial from huggingface. Naturally, I decided to test the model with this new training loop before implementing accelerate to ensure it actually worked.
Here's the relevant code from my original model, as well as the corresponding code from the new training loop:
Note: BATCH_SIZE
is equal to 2 in both models. All code not shown is exactly the same between both models.
Original:
data = data['train']
dc = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
train_args = TrainingArguments(
output_dir=OUTPUT_DIRECTORY,
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=BATCH_SIZE,
save_steps=10_000,
save_total_limit=1, # How many "checkpoints" to save at a time
prediction_loss_only=True,
remove_unused_columns=False,
optim="adamw_torch"
)
trainer = Trainer(
model=model,
args=train_args,
data_collator=dc,
train_dataset=data
)
trainer.train()
Custom Train Loop:
data = data['train']
dc = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
optimizer = AdamW(model.parameters(), lr=5e-5)
train_dl = DataLoader(
data, shuffle=True, batch_size=BATCH_SIZE, collate_fn=dc
)
epochs = 1
training_steps = epochs * len(train_dl)
scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=training_steps
)
progress_bar = tqdm(range(training_steps))
model.train()
for epoch in range(epochs):
for batch in train_dl:
# Run a batch through the model
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
I tested it (with one node of course) with two GPUs, both 16GB each. And it worked... but suspiciously well.
This is absolutely bizarre. How is it possible that simply adding my own training loop, that's just a couple of lines of code, is not only faster than the official one provided by Huggingface - but nearly TWICE as fast? Did I write the training loop incorrectly? Am I completely missing something here?
Upvotes: 1
Views: 1673
Reputation: 15845
In your training loop, you call optimizer.step()
directly after computing the loss, with no gradient accumulation.
Default Trainer uses gradient accumulation (1 gradient accumulation step by default), this causes gradient to be accumulated over multiple batches before model weights update; this is useful to improve accuracy but slows down the training procedure.
Upvotes: 2