GZinn
GZinn

Reputation: 11

PyTorch Lightning's ReduceLRonPlateau not working properly

I have been trying to write a lightning module using both a warmup and an annealing function ReduceLROnPlateau and something really odd is happening. If the program reduces the learning rate, the learning rate simply pops back up to the base value on the next step. So for example, if the base learning rate was 1e-3 and ReduceLROnPlateau fired on step 145, my module will reset the learning rate to 1e-3 on step 146. It is hard to see in the image provided, but that is because of TensorBoard's smoothing factor.

I can't figure out why this is happening and I haven't found anything online about it. I am assuming this means I am making a very silly mistake. Does anyone have any ideas? I have attached my optimizer code below for reference.

def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=self.params\['training'\]\['learning_rate'\])

        # Lambda function for warmup
        warmup_steps = self.params['training']['warmup_epochs'] * self.params['training']['train_steps_per_epoch']
        lambda_warmup = lambda step: min(step / warmup_steps, 1)
    
        warmup_scheduler = {
            'scheduler': LambdaLR(optimizer, lr_lambda=lambda_warmup),
            'interval': 'step',
            'name': 'warmup',
        }
    
        reduce_on_plateau_scheduler = {
            'scheduler': ReduceLROnPlateau(
                optimizer,
                mode='min',
                factor=0.1,
                patience=10,
                verbose=True
            ),
            'monitor': 'val_loss'
        }
    
        return [optimizer], [warmup_scheduler, reduce_on_plateau_scheduler]

As a note, I slightly changed my settings to get the

screenshot of my learning rate

during my validation step. I reduced the patience and additionally used loss = loss / loss to ensure that the val_loss was always 1 (i.e., on plateau). This was to speed up the process and force ReduceLROnPlateau to fire. But the same issue also happens when doing normal, longer training runs. Also I am confident that ReduceLROnPlateau is firing, as it states as such in the terminal.

I think that my issue has something to do with my warmup setup. My assumption is that it forces the values back up to my base learning rate. Is there an easy way to fix this?

Thank you so much in advance!

Upvotes: 1

Views: 667

Answers (0)

Related Questions