Reputation: 4547
I'm trying to fine-tune a Vgg16
model using colaboratory
but I ran into this error when training with the GPU.
OOM when allocating tensor of shape [7,7,512,4096]
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor of shape [7,7,512,4096] and type float
[[Node: vgg_16/fc6/weights/Momentum/Initializer/zeros = Const[_class=["loc:@vgg_16/fc6/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: [7,7,512,4096] values: [[[0 0 0]]]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
Caused by op 'vgg_16/fc6/weights/Momentum/Initializer/zeros', defined at:
also have this output for my vm session:
--- colab vm info ---
python v=3.6.3
tensorflow v=1.4.1
tf device=/device:GPU:0
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
MemTotal: 13341960 kB
MemFree: 1541740 kB
MemAvailable: 10035212 kB
My tfrecord
is just 118 256x256 JPGs with file size <2MB
Is there a workaround? it works when I use the CPU, just not the GPU
Upvotes: 7
Views: 13883
Reputation: 38704
Seeing a small amount of free GPU memory almost always indicates that you've created a TensorFlow session without the allow_growth = True
option. See:
https://www.tensorflow.org/guide/using_gpu#allowing_gpu_memory_growth
If you don't set this option, by default, TensorFlow will reserve nearly all GPU memory when a session is created.
Good news: As of this week, Colab now sets this option by default, so you should see much lower growth as you use multiple notebooks on Colab. And, you can also inspect GPU memory usage per notebook by selecting 'Manage session's from the runtime menu.
Once selected, you'll see a dialog that lists all notebooks and the GPU memory each is consuming. To free memory, you can terminate runtimes from this dialog as well.
Upvotes: 4
Reputation: 11
I met the same issue, and I found my problem was caused by the code below:
from tensorflow.python.framework.test_util import is_gpu_available as tf
if tf()==True:
device='/gpu:0'
else:
device='/cpu:0'
I used below Code to check the GPU memory usage status and find the usage is 0% before running the code above, and it became 95% after.
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn't guaranteed
gpu = GPUs[0]
def printm():
process = psutil.Process(os.getpid())
print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " I Proc size: " + humanize.naturalsize( process.memory_info().rss))
print('GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB'.format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()
Before:
Gen RAM Free: 12.7 GB I Proc size: 139.1 MB
GPU RAM Free: 11438MB | Used: 1MB | Util 0% | Total 11439MB
After:
Gen RAM Free: 12.0 GB I Proc size: 1.0 GB
GPU RAM Free: 564MB | Used: 10875MB | Util 95% | Total 11439MB
Somehow, is_gpu_available() managed consume most of the GPU memory without release them after, so instead, I used below code to detect the gpu status for me, problem solved
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
try:
import GPUtil as GPU
GPUs = GPU.getGPUs()
device='/gpu:0'
except:
device='/cpu:0'
Upvotes: 1
Reputation: 1535
In my case I didn't solve with solution provided by Ami, even if it's excellent, probably because Colaboratory VM couldn't furnish more resources.
I had the OOM error in detection phase (not model training). I solved with a workaround, disabling GPU for detection:
config = tf.ConfigProto(device_count = {'GPU': 0})
sess = tf.Session(config=config)
Upvotes: 0
Reputation: 2282
I failed to repro the originally-reported error, but if that is caused by running out of GPU memory (as opposed to main memory) this might help:
# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
and then pass session_config=config
to e.g. slim.learning.train()
(or whatever session ctor you end up using).
Upvotes: 0