Reputation: 549
I am trying to fit a model in TensorFlow 2.2. I have written a custom training loop. However, the training crashes pretty soon as the GPU runs out of memory. The model works fine with the same parameters when using the model.fit() API but I want to use a custom training loop as it offers more flexibility for my needs.
What is the general way of debugging such memory issues?
I searched around but the official TF docs mostly speak on how to debug logical errors. A lot of pages talk about debugging based on the Graph mode.
Any advice would be appreciated! Thanks in advance!
UPDATE 1 The code being used is here on Colab The input data is from the UCF-Crime dataset, it has been preprocessed into jpegs and stored as segments in a TFRecord. A sample TFRecord from the same is here, each TFRecord contains 500 segments, where each segment is 16 consecutive jpeg encoded frames of a video, scaled down to 128x128 RGB images.
For reproducing the issue, you can build a dataset by putting the file path of the TFRecord into mod_build_dataset()
The code will immediately crash on CPU in Colab(as max RAM available is ~13GB), same for GPU.
On Kaggle it will work fine on CPU (RAM usage maxes out at 14.2 GB) but will crash on GPU(uses all 16GB of graphics memory)
I am guessing this is because the CPU is doing the computations slower, so the GC gets time to kick in before it runs out of memory, whereas on the GPU it doesnt.
Upvotes: 1
Views: 1667
Reputation:
Tensorflow Profiler
should help you.
Profiling
helps you understand the hardware resource consumption (time
and memory
) of the various TensorFlow operations (ops)
in your model
and resolve performance bottlenecks and ultimately, make the model execute faster.
The Tensorflow Profiler
makes pinpointing the bottleneck
of the training process
much easier, so you can decide where the optimization effort should be put into.
It also gives you recommendations on potential next steps you can follow to optimize your model performance.
Steps to use Tensorflow Profiler for Custom Training Loops is shown below:
from tensorflow.python.profiler import profiler_v2 as profiler
profiler.warmup()
profiler.start(logdir='logdir')
# Train the model here
profiler.stop()
Please refer this article on how to use Tensorflow Profiler
for Custom Training
.
For more information about Tensorflow Profiler
please refer this Tutorial, this Guide and this documentation.
Upvotes: 1