How to debug Out of Memory issues with eager execution on TensorFlow 2?

Question

I am trying to fit a model in TensorFlow 2.2. I have written a custom training loop. However, the training crashes pretty soon as the GPU runs out of memory. The model works fine with the same parameters when using the model.fit() API but I want to use a custom training loop as it offers more flexibility for my needs.

What is the general way of debugging such memory issues?

I searched around but the official TF docs mostly speak on how to debug logical errors. A lot of pages talk about debugging based on the Graph mode.

Any advice would be appreciated! Thanks in advance!

UPDATE 1 The code being used is here on Colab The input data is from the UCF-Crime dataset, it has been preprocessed into jpegs and stored as segments in a TFRecord. A sample TFRecord from the same is here, each TFRecord contains 500 segments, where each segment is 16 consecutive jpeg encoded frames of a video, scaled down to 128x128 RGB images.

For reproducing the issue, you can build a dataset by putting the file path of the TFRecord into mod_build_dataset()

The code will immediately crash on CPU in Colab(as max RAM available is ~13GB), same for GPU.

On Kaggle it will work fine on CPU (RAM usage maxes out at 14.2 GB) but will crash on GPU(uses all 16GB of graphics memory)

I am guessing this is because the CPU is doing the computations slower, so the GC gets time to kick in before it runs out of memory, whereas on the GPU it doesnt.

How to debug Out of Memory issues with eager execution on TensorFlow 2?

Answers (1)

Related Questions