loss additionally logged as second TensorBoard run (unwanted), leading to crash due to too many files

Question

I'm using a TensorBoard logger via pytorch-lightning like this:

tb_logger = pl_loggers.TensorBoardLogger(args.dir)
trainer = pl.Trainer(...., logger=tb_logger, check_val_every_n_epoch=20, log_every_n_steps=500, num_sanity_val_steps=0)

in my train_epoch_end and validation_epoch_end I log everything like this:

tensorboard = self.logger.experiment
tensorboard.add_scalar('Acc', acc, self.current_epoch)

However when I monitor the training run, TensorBoard actually shows me two runs, one called default/version_0, which has all my scalars, histograms etc. So what I want and as intended. Another training run called GLOBAL logs a scalar called nll_loss_output_0. I'm merely calling loss = torch.nn.functional.cross_entropy(logits, targets) and don't understand where this second run comes from.

My local TensorBoard tells me there are too many files in the GLOBAL folder (2800+) and SageMaker with TB monitor has an InternalServerError and the whole training run fails a third of the way in.

In train_step, I am returning:

log = {
'train_loss': loss.detach(),
'acc1': acc1,
'acc10': acc10
}
return {'loss': loss, 'data': log}

But I'm not calling anything "nll_loss_output_0" anywhere... could someone please advise on how to get rid of the GLOBAL run altogether? Setting log_every_n_steps to 2000000 or so might not fix it, as I'm getting multiple files per log train_step: screenshot of GLOBAL folder content

Or is there any logging implicit for learning rate schedulers?

def configure_optimizers(self):
    optimizer = torch.optim.SGD(self.parameters(), lr=self.lr)
    lr_schedule = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, self.trainer.max_epochs )
    self.optimizer_ = optimizer
    return [optimizer],[lr_schedule]

I'm also using a checkpoint callback:

checkpoint_callback = ModelCheckpoint(
    save_top_k=-1,
    dirpath=os.path.join(args.output_dir,'checkpoints/'),
    filename='checkpoint{epoch:04d}',
    auto_insert_metric_name=False,
    every_n_epochs=20,
    save_on_train_epoch_end=True)

Would be great if someone could help me out!

All the best,

Jonas

loss additionally logged as second TensorBoard run (unwanted), leading to crash due to too many files

Answers (1)

Related Questions