How to get step-wise validation loss curve over all epochs in PyTorch Lightning

Question

When logging my validation loss inside validation_step() in PyTorch Lighnting like this:

def validation_step(self, batch: Tuple[Tensor, Tensor], _batch_index: int) -> None:
    inputs_batch, labels_batch = batch

    outputs_batch = self(inputs_batch)
    loss = self.criterion(outputs_batch, labels_batch)

    self.log('loss (valid)', loss.item())

Then, I get an epoch-wise loss curve:

If I want the step-wise loss curve I can set on_step=True:

def validation_step(self, batch: Tuple[Tensor, Tensor], _batch_index: int) -> None:
    inputs_batch, labels_batch = batch

    outputs_batch = self(inputs_batch)
    loss = self.criterion(outputs_batch, labels_batch)

    self.log('loss', loss.item(), on_step=True)

This results in step-wise loss curves for each epoch:

How can I get a single graph over all epochs instead? When running my training for thousands of epochs this gets messy.

Fredrik · Accepted Answer

It seems that you have done something wrong when init your logger. Is it defined as the following:

logger = TensorBoardLogger("tb_logs", name="my_model")

Note that on_step will modify your tag which is one cause why they show up as separate images.

Instead of using on_step you can use:

self.logger.experiment.add_scalar('name',metric)

If you want the plots x axis to show number of epochs instead of steps you can place the logger within validation_epoch_end(self, outputs).

def validation_epoch_end(self, outputs):
   avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
   self.logger.experiment.add_scalar('loss',avg_loss, self.current_epoch)

How to get step-wise validation loss curve over all epochs in PyTorch Lightning

Answers (1)

Related Questions