Intermittent "RuntimeError: CUDA out of memory" error in Google Colab Fine Tuning BERT Base Cased with Transformers and PyTorch

Question

I'm running the following code to fine-tune a BERT Base Cased model in Google Colab. Sometimes the code runs fine first time without error. Other times, the same code, using the same data, results in a "CUDA out of memory" error. Previously, restarting the runtime or exiting the notebook, going back into the notebook, doing a factory runtime restart, and re-running the code runs successfully without error. Just now though, I've tried a restart and re-try 5 times and got the error every time.

The issue doesn't appear to be the combination of data and code that I'm using because sometimes it works without error. So it appears to be something to do with the Google Colab runtime.

Does anyone know why this is happening, why it is intermittent, and/or what I can do about it?

I'm using Huggingface's transformers library and PyTorch.

The code cell that results in an error:

# train the model
%%time

history = defaultdict(list)

for epoch in range(EPOCHS):

  print(f'Epoch {epoch + 1}/{EPOCHS}')
  print('-' * 10)

  train_acc, train_loss = train_epoch(
    model,
    train_data_loader,    
    loss_fn, 
    optimizer, 
    device, 
    scheduler, 
    train_set_length
  )

  print(f'Train loss {train_loss} accuracy {train_acc}')

  dev_acc, dev_loss = eval_model(
    model,
    dev_data_loader,
    loss_fn, 
    device, 
    evaluation_set_length
  )

  print(f'Dev   loss {dev_loss} accuracy {dev_acc}')

  history['train_acc'].append(train_acc)
  history['train_loss'].append(train_loss)
  history['dev_acc'].append(dev_acc)
  history['dev_loss'].append(dev_loss)

  model_filename = f'model_{epoch}_state.bin'
  torch.save(model.state_dict(), model_filename)

The full error:


RuntimeError                              Traceback (most recent call last)
 in ()
----> 1 get_ipython().run_cell_magic('time', '', "
history = defaultdict(list)

for epoch in range(EPOCHS):

  print(f'Epoch {epoch + 1}/{EPOCHS}')
  print('-' * 10)

  train_acc, train_loss = train_epoch(
    model,
    train_data_loader,    
    loss_fn, 
    optimizer, 
    device, 
    scheduler, 
    train_set_length
  )

  print(f'Train loss {train_loss} accuracy {train_acc}')

  dev_acc, dev_loss = eval_model(
    model,
    dev_data_loader,
    loss_fn, 
    device, 
    evaluation_set_length
  )

  print(f'Dev   loss {dev_loss} accuracy {dev_acc}')

  history['train_acc'].append(train_acc)
  history['train_loss'].append(train_loss)
  history['dev_acc'].append(dev_acc)
  history['dev_loss'].append(dev_loss)
  
  model_filename = f'model_{epoch}_state.bin'
  torch.save(model.state_dict(), model_filename)")

15 frames
 in time(self, line, cell, local_ns)

 in ()

/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask)
    234         # Take the dot product between "query" and "key" to get the raw attention scores.
    235         attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
--> 236         attention_scores = attention_scores / math.sqrt(self.attention_head_size)
    237         if attention_mask is not None:
    238             # Apply the attention mask is (precomputed for all layers in BertModel forward() function)

RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 7.43 GiB total capacity; 5.42 GiB already allocated; 8.94 MiB free; 5.79 GiB reserved in total by PyTorch)

Intermittent "RuntimeError: CUDA out of memory" error in Google Colab Fine Tuning BERT Base Cased with Transformers and PyTorch

Answers (1)

Related Questions

Intermittent &quot;RuntimeError: CUDA out of memory&quot; error in Google Colab Fine Tuning BERT Base Cased with Transformers and PyTorch

Answers (1)

Related Questions

Intermittent "RuntimeError: CUDA out of memory" error in Google Colab Fine Tuning BERT Base Cased with Transformers and PyTorch