NN Regression Training loss initial increase

Question

I'm tuning up a NN to predict growth within a dataset, but I am having two issues in my experimentation; 1) Running the experiments multiple times is yielding loss values varying by as much as 0.5 sometimes, 2) The test loss is producing an initial increase before it starts to decrease. This is consistent across runs also.

I have been trying to make my experimentation reproducible, but results are varying by a substantial margin (test loss can be anywhere between 0.8 and 1.7 for the same exp). I have tried to reduce randomness in a variety of ways including seeding, using a generator in the dataset shuffling, and turning off AMP. None of those have seen a boost yet. Datasets are always the same. The only thought I had, was that it could be something to do with the random weight initialisation.
This is an unusual test oss curve, as I understand it I am expecting the test loss to drop and produce a graph similar to the training loss curve. As you can see, sometimes the test loss increases so high that before it is able to improve at all, the early stopping terminates the run. The diagrams aren't always indicative of the final optimal saved checkpoint. The optimal model is saved from any run.

test_loss_curve

Would love some thoughts on how why the test loss could possibly have such a pattern? Doesn't seem like a straightforward overfitting because it does come back down, further even most of the time.

*AMP and SchedulerLR are currently disabled

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model = bruteForceNN().to(device)
criterion = nn.MSELoss()
optimiser = optim.Adam(model.parameters(), lr=0.0001)

start_epoch = 0

checkpoint = {} # Initialise empty checkpoint as an empty dictionary

# Training loop
epochs = 200
l1_lambda = 1e-4  # Regularisation strength

# Initialise early stopping parameters
patience =20
best_loss = float('inf')
best_epoch = 0
counter = 0
early_stop = False

print("Starting Training Loop")
scheduler = ReduceLROnPlateau(optimiser, mode='min', factor=0.5, patience=10)

# Check if CUDA is available
use_cuda = torch.cuda.is_available()

for epoch in range(start_epoch, epochs):

    if early_stop:
        print(f"Early stopped at epoch {epoch}")
        break
    
    model.train()
    epoch_loss = 0
    print(f"============ Epoch {epoch} ==========")

    for batch_X, batch_time, batch_y in train_loader:

        batch_X = batch_X.to(device)
        batch_time = batch_time.to(device)
        batch_y = batch_y.to(device)

        optimiser.zero_grad()  #  Gradients from previous batch are zeroed
        
        with amp.autocast(device_type='cuda' if use_cuda else 'cpu', enabled=False):  # Forward pass with mixed precision
            predictions = model(batch_X, batch_time)  # Predictions are generated

            # Data Loss
            data_loss = criterion(predictions, batch_y)

            # Regularisation Loss
            l1_loss = sum(torch.sum(torch.abs(param)) for param in model.parameters())

            loss = data_loss + l1_lambda * l1_loss

        # Backpropagation
        loss.backward(retain_graph=True) # Scale the loss before backward pass
        
        optimiser.step() # Model parameters are updated

        epoch_loss += loss.item()

    epoch_loss /= len(train_loader)

    current_lr = optimiser.param_groups[0]['lr']
    wandb.log({"epoch": epoch + 1, "learning_rate": current_lr, "train_loss": epoch_loss})
    
    # Evaluate model
    model.eval()
    with torch.no_grad():
                
        X_test_tensor = X_test_tensor.to(device)
        y_test_tensor = y_test_tensor.to(device)
        time_test_tensor = time_test_tensor.to(device)

        with amp.autocast(device_type='cuda' if use_cuda else 'cpu', enabled=False):
            test_predictions = model(X_test_tensor, time_test_tensor)
            test_loss = criterion(test_predictions, y_test_tensor)

        # scheduler.step(test_loss)

        if test_loss < best_loss:
            best_loss = test_loss
            best_epoch = epoch
            counter = 0
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimiser_state_dict': optimiser.state_dict(),
                'loss': epoch_loss,
                'wandb_run_id': wandb_run.id,
            }, checkpoint_path)
            print("Model improved and saved!")
            
        else:
            counter += 1
            print(f"No improvement for {counter} epochs.")

        if (counter >= patience):
            print("Early stopping triggered.")
            early_stop = True
            break


    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch + 1}, Loss: {epoch_loss:.4f}")

# Load the best point before final save
best_checkpoint = torch.load(checkpoint_path, weights_only=False)
model.load_state_dict(best_checkpoint['model_state_dict'])
optimiser.load_state_dict(best_checkpoint['optimiser_state_dict'])


try:
    torch.save({
        'epoch': best_checkpoint['epoch'],
        'model_state_dict': best_checkpoint['model_state_dict'],
        'optimiser_state_dict': best_checkpoint['optimiser_state_dict'],
        'loss': best_checkpoint['loss'],
        'wandb_run_id': best_checkpoint['wandb_run_id'],
    }, save_path)
    print(f"Best model successfully saved to: {save_path}")
except Exception as e:
    print(f"Error while saving: {e}")

NN Regression Training loss initial increase

Answers (0)

Related Questions