aleio1
aleio1

Reputation: 94

Tensorflow Out of memory and CPU/GPU usage

I am using Tensorflow with Keras to train a neural network for object recognition (YOLO).

I wrote the model and I am trying to train it using keras model.fit_generator() with batches of 32 416x416x3 images.

I am using a NVIDIA GEFORCE RTX 2070 GPU with 8GB memory (Tensorflow uses about 6.6 GB).

However when I start training the model I receive messages like this:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape

2019-02-11 16:13:08.051289: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 338.00MiB.  Current allocation summary follows.
2019-02-11 16:13:08.057318: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):   Total Chunks: 1589, Chunks in use: 1589. 397.3KiB allocated for chunks. 397.3KiB in use in bin. 25.2KiB client-requested in use in bin.
2019-02-11 16:13:08.061222: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (512):   Total Chunks: 204, Chunks in use: 204. 102.0KiB allocated for chunks. 102.0KiB in use in bin. 100.1KiB client-requested in use in bin.
...
2019-02-11 16:13:08.142674: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (268435456):     Total Chunks: 11, Chunks in use: 11. 5.05GiB allocated for chunks. 5.05GiB in use in bin. 4.95GiB client-requested in use in bin.
2019-02-11 16:13:08.148149: I tensorflow/core/common_runtime/bfc_allocator.cc:613] Bin for 338.00MiB was 256.00MiB, Chunk State:
2019-02-11 16:13:08.150474: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 000000070B400000 of size 1280
2019-02-11 16:13:08.152627: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 000000070B400500 of size 256
2019-02-11 16:13:08.154790: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 000000070B400600 of size 256
....
2019-02-11 16:17:38.699526: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 6.11GiB
2019-02-11 16:17:38.701621: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats:
Limit:                  6624727531
InUse:                  6557567488
MaxInUse:               6590199040
NumAllocs:                    3719
MaxAllocSize:           1624768512

2019-02-11 16:17:38.708981: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ****************************************************************************************************
2019-02-11 16:17:38.712172: W tensorflow/core/framework/op_kernel.cc:1412] OP_REQUIRES failed at conv_ops_fused.cc:734 : Resource exhausted: OOM when allocating tensor with shape[16,256,52,52] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

I reported only few lines of that message but it seems clear that is a memory usage problem.

Maybe should I use CPU in my generator function for reading images and labels from files? In this case how should I do?

Thank you.

Upvotes: 0

Views: 3835

Answers (1)

Daniel Möller
Daniel Möller

Reputation: 86600

416x416 is quite a big size for neural networks.

The solution in this case is to reduce the batch size.

Other solutions that you might not like are:

  • reduce the model capacity (less units/filters in layers)
  • reduce the image size
  • try float32 if you're using float64 (this might be very hard in Keras depending on which layers you're using)

Keras/Tensorflow has a strange behavior when allocating memory. I don't know how it works, but I've seem rather big models pass and smaller models fail. These smaller models, however, had more intricate operations and branches.

An important thing:

If this problem is happening in your first conv layer, there is nothing that can be done in the rest of the model, you need to reduce the filters of the first layer (or the image size)

Upvotes: 4

Related Questions