My training and validation loss suddenly increased in power of 3

Question

train function

def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0
    for i, batch in enumerate(iterator):
        optimizer.zero_grad()
        output = model(batch.text)
        loss = criterion(output, torch.unsqueeze(batch.labels, 1))
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(iterator)

main_script

def main(
        train_file,
        test_file,
        config_file,
        checkpoint_path,
        best_model_path
    ):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    with open(config_file, 'r') as j:
        config = json.loads(j.read())

    for k,v in config['model'].items():
        v = float(v)
        if v < 1.0:
            config['model'][k] = float(v)
        else:
            config['model'][k] = int(v)

    for k,v in config['training'].items():
        v = float(v)
        if v < 1.0:
            config['training'][k] = float(v)
        else:
            config['training'][k] = int(v)

    train_itr, val_itr, test_itr, vocab_size = data_pipeline(
        train_file,
        test_file,
        config['training']['max_vocab'],
        config['training']['min_freq'],
        config['training']['batch_size'],
        device
    )

    model = CNNNLPModel(
        vocab_size,
        config['model']['emb_dim'],
        config['model']['hid_dim'],
        config['model']['model_layer'],
        config['model']['model_kernel_size'],
        config['model']['model_dropout'],
        device
    )
    optimizer = optim.Adam(model.parameters())
    criterion = nn.CrossEntropyLoss()
    num_epochs = config['training']['n_epoch']
    clip = config['training']['clip']
    is_best = False
    best_valid_loss = float('inf')
    model = model.to(device)
    for epoch in tqdm(range(num_epochs)):

        train_loss = train(model, train_itr, optimizer, criterion, clip)
        valid_loss = evaluate(model, val_itr, criterion)

        if (epoch + 1) % 2 == 0:
            print("training loss {}, validation_loss{}".format(train_loss,valid_loss))

I was training a Convolution Neural Network for binary Text classification. Given a sentence, it detects its a hate speech or not. Training loss and validation loss was fine till 5 epoch after that suddenly the training loss and validation loss shot up suddenly from 0.2 to 10,000.

What could be the reason for such huge increase is loss suddenly?

Szymon Maszke · Accepted Answer

Default learning rate of Adam is 0.001, which, depending on task, might be too high.

It looks like instead of converging your neural network became divergent (it left the previous ~0.2 loss minima and fell into different region).

Lowering your learning rate at some point (after 50% or 70% percent of training) would probably fix the issue.

Usually people divide the learning rate by 10 (0.0001 in your case) or by half (0.0005 in your case). Try with dividing by half and see if the issue persist, in general you would want to keep your learning rate as high as possible until divergence occurs as is probably the case here.

This is what schedulers are for (gamma specifies learning rate multiplier, might want to change that to 0.5 first).

One can think of lower learning rate phase as fine-tuning already found solution (placing weights in better region of the loss valley) and might require some patience.

My training and validation loss suddenly increased in power of 3

Answers (1)

Related Questions