Why do I get CUDA_ERROR_OUT_OF_MEMORY: out of memory on Nvidia Quadro 8000, with more than enough available memory on Tensorflow-gpu 2.0

Question

We recently got a Quadro 8000 for training purposes at our lab. However, I am not able to run the simplest of codes, where cuda_driver.cc complains about failing to allocate memory (with subsequent messages indicating that cuda failed to allocate 38.17G, then 34.36G, 30.92G, 27.83G, 25.05G, 22.54G) even when GPU:0 is shown to be having 39090 MB memory. I am using miniconda based python with tensorflow-gpu 2.0.0 and compatible versions of cudnn(7.6.4) and cudatoolkit(10.0.130) pulled automatically using conda install. The simple code is as follows.

from __future__ import absolute_import, division, print_function, unicode_literals
import MClib
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
      print('tf Memory growth : %r' % (tf.config.experimental.get_memory_growth(gpus[0])))
      tf.config.experimental.set_virtual_device_configuration(
        gpus[0],[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512*38)])

    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print("%d Physical GPUs, %d Logical GPUs" % (len(gpus), len(logical_gpus)))
    a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
    c = tf.matmul(a, b)
    print(c)
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

Fixes that I have tried:

Fixes such as rebooting the pc(duh), decreasing the batch size, etc. still give a problem. When I train using a small dataset, sometimes the training continues despite of the memory errors, something that others have encountered but were able to solve (if I remember correctly) by fiddling with the memory_growth options. This solution did not help me.

I do have a temporary fix, by setting a memory limit by uncommenting the first two lines after the try statement in the code above. I discovered though, that I cannot force the gpu to allocate more than approximately 20G, eventhough the gpu has around double the memory available. Googling around points to setting either the memory growth or memory limit on the GPU. I have even tried to set both (uncommenting the 3rd and 4th lines as well), but to no avail. Has anyone come across a similar issue? Or is there a limit to the GPU memory that one can use?

System: Dell  Precision 5820
Processor: Intel Xeon W-2123 CPU @ 3.60 GHz, 4 cores, 8 processors
RAM: 16G

Why do I get CUDA_ERROR_OUT_OF_MEMORY: out of memory on Nvidia Quadro 8000, with more than enough available memory on Tensorflow-gpu 2.0

Fixes that I have tried:

Answers (1)

Related Questions