Jia Li
Jia Li

Reputation: 81

RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

Problem: when I run the following command

python -c "import tensorflow as tf; tf.test.is_gpu_available(); print('version :' + tf.__version__)"

Error:

RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

Details:

WARNING:tensorflow:From :1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use tf.config.list_physical_devices('GPU') instead. 2021-04-18 21:02:51.839069: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2021-04-18 21:02:51.846775: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2500000000 Hz 2021-04-18 21:02:51.847076: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fc3bc000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-04-18 21:02:51.847104: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2021-04-18 21:02:51.849876: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2021-04-18 21:02:51.911161: W tensorflow/compiler/xla/service/platform_util.cc:210] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error 2021-04-18 21:02:51.911285: I tensorflow/compiler/jit/xla_gpu_device.cc:161] Ignoring visible XLA_GPU_JIT device. Device number is 0, reason: Internal: no supported devices found for platform CUDA 2021-04-18 21:02:51.911546: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-18 21:02:51.912210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:00:07.0 name: GRID T4-4Q computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 3.97GiB deviceMemoryBandwidth: 298.08GiB/s 2021-04-18 21:02:51.912446: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2021-04-18 21:02:51.914362: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2021-04-18 21:02:51.916358: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2021-04-18 21:02:51.916679: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2021-04-18 21:02:51.918787: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2021-04-18 21:02:51.919993: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2021-04-18 21:02:51.924652: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-04-18 21:02:51.924792: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-18 21:02:51.925488: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-04-18 21:02:51.926100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0 2021-04-18 21:02:51.926146: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 Traceback (most recent call last): File "", line 1, in File "/home/miniconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func return func(*args, **kwargs) File "/home/miniconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/test_util.py", line 1496, in is_gpu_available for local_device in device_lib.list_local_devices(): File "/home/miniconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/client/device_lib.py", line 43, in list_local_devices _convert(s) for s in _pywrap_device_lib.list_devices(serialized_config) RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

System information:

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 18.04 Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: cloud server TensorFlow installed from (source or binary): source
TensorFlow version: 2.2.0. Python version: 3.7.7
Installed using virtualenv? pip? conda?: pip & conda.
Bazel version (if compiling from source): 2..0.0
GCC/Compiler version (if compiling from source): 7.5
CUDA/cuDNN version: CUDA 10.1 & cuDNN 7.6.5
GPU model and memory:
00:07.0 VGA compatible controller: NVIDIA Corporation Device 1eb8 (rev a1) (prog-if 00 [VGA controller]).
Subsystem: NVIDIA Corporation Device 130e.
Physical Slot: 7 Flags: bus master, fast devsel, latency 0, IRQ 37 Memory at fc000000 (32-bit, non-prefetchable) [size=16M] Memory at e0000000 (64-bit, prefetchable) [size=256M] Memory at fa000000 (64-bit, non-prefetchable) [size=32M] I/O ports at c500 [size=128] Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

enter image description here

enter image description here

I tried looking for solutions to this problem but none of them solved it:

https://forums.developer.nvidia.com/t/all-cuda-capable-devices-are-busy-or-unavailable-what-is-wrong/112858

https://github.com/tensorflow/tensorflow/issues/41990

Tensorflow-GPU Error: "RuntimeError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable"

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#recommended-post

https://github.com/tensorflow/tensorflow/issues/48558

https://programmersought.com/article/94034772029/

Upvotes: 1

Views: 12248

Answers (2)

DholuBholu
DholuBholu

Reputation: 314

You can try to reboot the system. Your GPU is Occupied by the previous run and didn't got free since then.

Upvotes: 1

Fabiano Tarlao
Fabiano Tarlao

Reputation: 3232

I can confirm the case mentioned in a comment.

I had the problem while working with an Ubuntu VM, executed on VMware ESXi host, and using a vGPU partition for a v100 Nvidia GPU.

I got the same error, and I have already tried changing cuda versions and downloading (pip) softwares compiled for that specific CUDA versions, this has NOT solved the issue, the error:

tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

In my case I forgot to set the license server in /etc/nvidia/grid.conf, and I got exactly the same error, so in my case it was a GRID license issue ... fixing the grid config file and rebooting solved the issue.

Upvotes: 2

Related Questions