Reputation: 4547

google colaboratory `ResourceExhaustedError` with GPU

I'm trying to fine-tune a Vgg16 model using colaboratory but I ran into this error when training with the GPU.

OOM when allocating tensor of shape [7,7,512,4096]

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor of shape [7,7,512,4096] and type float
     [[Node: vgg_16/fc6/weights/Momentum/Initializer/zeros = Const[_class=["loc:@vgg_16/fc6/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: [7,7,512,4096] values: [[[0 0 0]]]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'vgg_16/fc6/weights/Momentum/Initializer/zeros', defined at:

also have this output for my vm session:

    --- colab vm info ---
python v=3.6.3
tensorflow v=1.4.1
tf device=/device:GPU:0
model name  : Intel(R) Xeon(R) CPU @ 2.20GHz
model name  : Intel(R) Xeon(R) CPU @ 2.20GHz
MemTotal:       13341960 kB
MemFree:         1541740 kB
MemAvailable:   10035212 kB

My tfrecord is just 118 256x256 JPGs with file size <2MB

Is there a workaround? it works when I use the CPU, just not the GPU

Upvotes: 7

Answers (4)

Bob Smith

Reputation: 38704

Seeing a small amount of free GPU memory almost always indicates that you've created a TensorFlow session without the allow_growth = True option. See: https://www.tensorflow.org/guide/using_gpu#allowing_gpu_memory_growth

If you don't set this option, by default, TensorFlow will reserve nearly all GPU memory when a session is created.

Good news: As of this week, Colab now sets this option by default, so you should see much lower growth as you use multiple notebooks on Colab. And, you can also inspect GPU memory usage per notebook by selecting 'Manage session's from the runtime menu.

Once selected, you'll see a dialog that lists all notebooks and the GPU memory each is consuming. To free memory, you can terminate runtimes from this dialog as well.

Upvotes: 4

Jianming Lin

Reputation: 11

I met the same issue, and I found my problem was caused by the code below:

from tensorflow.python.framework.test_util import is_gpu_available as tf
if tf()==True:
  device='/gpu:0'
else:
  device='/cpu:0'

I used below Code to check the GPU memory usage status and find the usage is 0% before running the code above, and it became 95% after.

# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize    
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn't guaranteed
gpu = GPUs[0]

def printm():
process = psutil.Process(os.getpid())
print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " I Proc size: " + humanize.naturalsize( process.memory_info().rss))
print('GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB'.format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))

printm()

Before:

Gen RAM Free: 12.7 GB I Proc size: 139.1 MB

GPU RAM Free: 11438MB | Used: 1MB | Util 0% | Total 11439MB

After:

Gen RAM Free: 12.0 GB I Proc size: 1.0 GB

GPU RAM Free: 564MB | Used: 10875MB | Util 95% | Total 11439MB

Somehow, is_gpu_available() managed consume most of the GPU memory without release them after, so instead, I used below code to detect the gpu status for me, problem solved

!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
try:
  import GPUtil as GPU
  GPUs = GPU.getGPUs()
  device='/gpu:0'
except:
  device='/cpu:0'

Upvotes: 1

RomRoc

Reputation: 1535

In my case I didn't solve with solution provided by Ami, even if it's excellent, probably because Colaboratory VM couldn't furnish more resources.

I had the OOM error in detection phase (not model training). I solved with a workaround, disabling GPU for detection:

config = tf.ConfigProto(device_count = {'GPU': 0})
sess = tf.Session(config=config)

Upvotes: 0

Ami F

Reputation: 2282

I failed to repro the originally-reported error, but if that is caused by running out of GPU memory (as opposed to main memory) this might help:

# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

and then pass session_config=config to e.g. slim.learning.train() (or whatever session ctor you end up using).

Upvotes: 0

google colaboratory `ResourceExhaustedError` with GPU

Answers (4)

Related Questions