Huggingface Transformers (PyTorch) - Custom training loop doubles speed?

Question

I've found something quite strange when using Huggingface Transformers with a custom training loop in PyTorch.

But first, some context: I'm currently trying to fine tune a pretrained GPT2 small (GPT2LMHeadModel; the ~170M param version) on multiple nodes, using Huggingface Accelerate. I'm using Huggingface's datasets library for training.

Of course, the first step in this process in accelerate is to write a custom PyTorch training loop, which I did with the help of the official tutorial from huggingface. Naturally, I decided to test the model with this new training loop before implementing accelerate to ensure it actually worked.

Here's the relevant code from my original model, as well as the corresponding code from the new training loop:

Note: BATCH_SIZE is equal to 2 in both models. All code not shown is exactly the same between both models.

Original:

data = data['train']

dc = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

train_args = TrainingArguments(
    output_dir=OUTPUT_DIRECTORY,
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=BATCH_SIZE,
    save_steps=10_000,
    save_total_limit=1, # How many "checkpoints" to save at a time
    prediction_loss_only=True,
    remove_unused_columns=False,
    optim="adamw_torch"
)

trainer = Trainer(
    model=model,
    args=train_args,
    data_collator=dc,
    train_dataset=data
)

trainer.train()

Custom Train Loop:

data = data['train']

dc = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

optimizer = AdamW(model.parameters(), lr=5e-5)

train_dl = DataLoader(
    data, shuffle=True, batch_size=BATCH_SIZE, collate_fn=dc
)

epochs = 1
training_steps = epochs * len(train_dl)
scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=training_steps
)

progress_bar = tqdm(range(training_steps))

model.train()
for epoch in range(epochs):
    for batch in train_dl:
        # Run a batch through the model
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

I tested it (with one node of course) with two GPUs, both 16GB each. And it worked... but suspiciously well.

My original model averaged about 1-2 iterations/s.
My custom loop on the other hand averaged about 3-4 iterations/s.

This is absolutely bizarre. How is it possible that simply adding my own training loop, that's just a couple of lines of code, is not only faster than the official one provided by Huggingface - but nearly TWICE as fast? Did I write the training loop incorrectly? Am I completely missing something here?

Huggingface Transformers (PyTorch) - Custom training loop doubles speed?

Answers (1)

Related Questions