Reputation: 623
I'm running the following code to fine-tune a BERT Base Cased model in Google Colab. Sometimes the code runs fine first time without error. Other times, the same code, using the same data, results in a "CUDA out of memory" error. Previously, restarting the runtime or exiting the notebook, going back into the notebook, doing a factory runtime restart, and re-running the code runs successfully without error. Just now though, I've tried a restart and re-try 5 times and got the error every time.
The issue doesn't appear to be the combination of data and code that I'm using because sometimes it works without error. So it appears to be something to do with the Google Colab runtime.
Does anyone know why this is happening, why it is intermittent, and/or what I can do about it?
I'm using Huggingface's transformers
library and PyTorch
.
The code cell that results in an error:
# train the model
%%time
history = defaultdict(list)
for epoch in range(EPOCHS):
print(f'Epoch {epoch + 1}/{EPOCHS}')
print('-' * 10)
train_acc, train_loss = train_epoch(
model,
train_data_loader,
loss_fn,
optimizer,
device,
scheduler,
train_set_length
)
print(f'Train loss {train_loss} accuracy {train_acc}')
dev_acc, dev_loss = eval_model(
model,
dev_data_loader,
loss_fn,
device,
evaluation_set_length
)
print(f'Dev loss {dev_loss} accuracy {dev_acc}')
history['train_acc'].append(train_acc)
history['train_loss'].append(train_loss)
history['dev_acc'].append(dev_acc)
history['dev_loss'].append(dev_loss)
model_filename = f'model_{epoch}_state.bin'
torch.save(model.state_dict(), model_filename)
The full error:
RuntimeError Traceback (most recent call last)
<ipython-input-29-a13774d7aa75> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', "\nhistory = defaultdict(list)\n\nfor epoch in range(EPOCHS):\n\n print(f'Epoch {epoch + 1}/{EPOCHS}')\n print('-' * 10)\n\n train_acc, train_loss = train_epoch(\n model,\n train_data_loader, \n loss_fn, \n optimizer, \n device, \n scheduler, \n train_set_length\n )\n\n print(f'Train loss {train_loss} accuracy {train_acc}')\n\n dev_acc, dev_loss = eval_model(\n model,\n dev_data_loader,\n loss_fn, \n device, \n evaluation_set_length\n )\n\n print(f'Dev loss {dev_loss} accuracy {dev_acc}')\n\n history['train_acc'].append(train_acc)\n history['train_loss'].append(train_loss)\n history['dev_acc'].append(dev_acc)\n history['dev_loss'].append(dev_loss)\n \n model_filename = f'model_{epoch}_state.bin'\n torch.save(model.state_dict(), model_filename)")
15 frames
<decorator-gen-60> in time(self, line, cell, local_ns)
<timed exec> in <module>()
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask)
234 # Take the dot product between "query" and "key" to get the raw attention scores.
235 attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
--> 236 attention_scores = attention_scores / math.sqrt(self.attention_head_size)
237 if attention_mask is not None:
238 # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 7.43 GiB total capacity; 5.42 GiB already allocated; 8.94 MiB free; 5.79 GiB reserved in total by PyTorch)
Upvotes: 5
Views: 3763
Reputation: 732
I was facing the same problem in transformers,Transformers are extremely memory intensive
. Hence, there is quite a high probability that we will run out of memory or the runtime limit while training larger models or for longer epochs.
There are some promising well-known out of the box strategies to solve these problems and each strategy comes with its own benefits.
Training neural networks on a batch of sequences requires them to have the exact same length to build the batch matrix representation. Because real life NLP datasets are always made of texts of variable lengths, we often need to make some sequences shorter by truncating them, and some others longer by adding at the end a repeated fake token called “pad” token.
Because the pad token doesn’t represent a real word, when most computations are done, before computing the loss, we erase the pad token signal by multiplying it by 0 through the “attention mask” matrix for each sample, which identifies the [PAD] tokens and tells Transformer to ignore them.
Dynamic Padding: Here we limit the number of added pad tokens to reach the length of the longest sequence of each mini batch instead of a fixed value set for the whole train set Because the number of added tokens changes across mini batches, we call it "dynamic" padding.
Uniform Length Batching:
We push the logic futher by generating batches made of similar length sequences so we avoid extreme cases where most sequences in the mini batch are short and we are required to add lots of pad tokens to each of them because 1 sequence of the same mini batch is very long.
Upvotes: 1