Reputation: 21
I am testing a recently bought ASUS ROG STRIX 1080 ti (11 GB) card via a simple test python (matmul.py) program from https://learningtensorflow.com/lesson10/ . The virtual environment (venv) setup is as follows : ubuntu=16.04, tensorflow-gpu==1.5.0, python=3.6.6, CUDA==9.0, Cudnn==7.2.1.
CUDA_ERROR_OUT_OF_MEMORY occured.
And, the most strangest : totalMemory: 10.91GiB freeMemory: 61.44MiB ..
I am not sure whether if it was due to the environmental setup or due to the 1080 ti itself. I would appreciate if any excerpts could advise here.
The terminal showed -
(venv) xx@xxxxxx:~/xx$ python matmul.py gpu 1500
2018-10-01 09:05:12.459203: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-10-01 09:05:12.514203: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-01 09:05:12.514445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.607
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 61.44MiB
2018-10-01 09:05:12.514471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-10-01 09:05:12.651207: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 11.44M (11993088 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
......
Upvotes: 1
Views: 688
Reputation: 21
After reboot, I was able to run sample codes of tersorflow.org - https://www.tensorflow.org/guide/using_gpu without memory issues.
Before running tensorflow samples codes for checking 1080 ti, I had a difficulty training Mask-RCNN models as posted - Mask RCNN Resource exhausted (OOM) on my own dataset After replacing cudnn 7.2.1 with 7.0.5, no more resource exhausted (OOM) issue occurred.
Upvotes: 1
Reputation: 91
I solved this problem by putting a cap on the memory usage:
def gpu_config():
config = tf.ConfigProto(
allow_soft_placement=True, log_device_placement=False)
config.gpu_options.allow_growth = True
config.gpu_options.allocator_type = 'BFC'
config.gpu_options.per_process_gpu_memory_fraction = 0.8
print("GPU memory upper bound:", upper)
return config
Then you can do just do:
config = gpu_config()
with tf.Session(config=config) as sess:
....
Upvotes: 1
Reputation: 1299
It can happen that a Python process gets stuck on the GPU. Always check the processes with nvidia-smi
and kill them manually if necessary.
Upvotes: 1