Reputation: 1
I'm tuning up a NN to predict growth within a dataset, but I am having two issues in my experimentation; 1) Running the experiments multiple times is yielding loss values varying by as much as 0.5 sometimes, 2) The test loss is producing an initial increase before it starts to decrease. This is consistent across runs also.
I have been trying to make my experimentation reproducible, but results are varying by a substantial margin (test loss can be anywhere between 0.8 and 1.7 for the same exp). I have tried to reduce randomness in a variety of ways including seeding, using a generator in the dataset shuffling, and turning off AMP. None of those have seen a boost yet. Datasets are always the same. The only thought I had, was that it could be something to do with the random weight initialisation.
This is an unusual test oss curve, as I understand it I am expecting the test loss to drop and produce a graph similar to the training loss curve. As you can see, sometimes the test loss increases so high that before it is able to improve at all, the early stopping terminates the run. The diagrams aren't always indicative of the final optimal saved checkpoint. The optimal model is saved from any run.
Would love some thoughts on how why the test loss could possibly have such a pattern? Doesn't seem like a straightforward overfitting because it does come back down, further even most of the time.
*AMP and SchedulerLR are currently disabled
# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
model = bruteForceNN().to(device)
criterion = nn.MSELoss()
optimiser = optim.Adam(model.parameters(), lr=0.0001)
start_epoch = 0
checkpoint = {} # Initialise empty checkpoint as an empty dictionary
# Training loop
epochs = 200
l1_lambda = 1e-4 # Regularisation strength
# Initialise early stopping parameters
patience =20
best_loss = float('inf')
best_epoch = 0
counter = 0
early_stop = False
print("Starting Training Loop")
scheduler = ReduceLROnPlateau(optimiser, mode='min', factor=0.5, patience=10)
# Check if CUDA is available
use_cuda = torch.cuda.is_available()
for epoch in range(start_epoch, epochs):
if early_stop:
print(f"Early stopped at epoch {epoch}")
break
model.train()
epoch_loss = 0
print(f"============ Epoch {epoch} ==========")
for batch_X, batch_time, batch_y in train_loader:
batch_X = batch_X.to(device)
batch_time = batch_time.to(device)
batch_y = batch_y.to(device)
optimiser.zero_grad() # Gradients from previous batch are zeroed
with amp.autocast(device_type='cuda' if use_cuda else 'cpu', enabled=False): # Forward pass with mixed precision
predictions = model(batch_X, batch_time) # Predictions are generated
# Data Loss
data_loss = criterion(predictions, batch_y)
# Regularisation Loss
l1_loss = sum(torch.sum(torch.abs(param)) for param in model.parameters())
loss = data_loss + l1_lambda * l1_loss
# Backpropagation
loss.backward(retain_graph=True) # Scale the loss before backward pass
optimiser.step() # Model parameters are updated
epoch_loss += loss.item()
epoch_loss /= len(train_loader)
current_lr = optimiser.param_groups[0]['lr']
wandb.log({"epoch": epoch + 1, "learning_rate": current_lr, "train_loss": epoch_loss})
# Evaluate model
model.eval()
with torch.no_grad():
X_test_tensor = X_test_tensor.to(device)
y_test_tensor = y_test_tensor.to(device)
time_test_tensor = time_test_tensor.to(device)
with amp.autocast(device_type='cuda' if use_cuda else 'cpu', enabled=False):
test_predictions = model(X_test_tensor, time_test_tensor)
test_loss = criterion(test_predictions, y_test_tensor)
# scheduler.step(test_loss)
if test_loss < best_loss:
best_loss = test_loss
best_epoch = epoch
counter = 0
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimiser_state_dict': optimiser.state_dict(),
'loss': epoch_loss,
'wandb_run_id': wandb_run.id,
}, checkpoint_path)
print("Model improved and saved!")
else:
counter += 1
print(f"No improvement for {counter} epochs.")
if (counter >= patience):
print("Early stopping triggered.")
early_stop = True
break
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch + 1}, Loss: {epoch_loss:.4f}")
# Load the best point before final save
best_checkpoint = torch.load(checkpoint_path, weights_only=False)
model.load_state_dict(best_checkpoint['model_state_dict'])
optimiser.load_state_dict(best_checkpoint['optimiser_state_dict'])
try:
torch.save({
'epoch': best_checkpoint['epoch'],
'model_state_dict': best_checkpoint['model_state_dict'],
'optimiser_state_dict': best_checkpoint['optimiser_state_dict'],
'loss': best_checkpoint['loss'],
'wandb_run_id': best_checkpoint['wandb_run_id'],
}, save_path)
print(f"Best model successfully saved to: {save_path}")
except Exception as e:
print(f"Error while saving: {e}")
Upvotes: -1
Views: 18