Reputation: 41
I am trying to run tensorflow on windows 10 with the following setup:
Anaconda3 with
python 3.8
tensorflow 2.2.0
GPU: RTX3090
cuda_10.1.243
cudnn-v7.6.5.32 for windows10-x64
Running the following code takes between 5 ~ 10 minutes to print the output.
import tensorflow as tf
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
I get the following output immediately, but then it hangs for few minutes before proceeding.
1-17 04:03:00.039069: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-11-17 04:03:00.042677: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-11-17 04:03:00.045041: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-11-17 04:03:00.045775: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-11-17 04:03:00.049246: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-11-17 04:03:00.050633: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-11-17 04:03:00.056731: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-11-17 04:03:00.056821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
Running the smae code on colab takes only a second.
Any suggestions? Thanks
Upvotes: 4
Views: 3767
Reputation: 1412
The reason is as Mux says.
Background:
See https://developer.nvidia.com/blog/cuda-pro-tip-understand-fat-binaries-jit-caching/ for full explanation.
The first stage compiles source device code to PTX virtual assembly, and the second stage compiles the PTX to binary code for the target architecture. The CUDA driver can execute the second stage compilation at run time, compiling the PTX virtual assembly “Just In Time” to run it.
So for old version software package with new hardware, that is binary code for target architecture is not precompiled, it fallbacks to PTX virtual assembly and trigger runtime JIT compile for the new target architecture. That mean CUDNN and CUBLAS kernels and tensorflow built-in kernels are all JIT compiled at startup, which incurs loooooog startup time in your case.
That is why Dan Pavlov suggests enables JIT caching, that is, you only JIT compile once, not JIT compile from time to time on startup.
Upvotes: 1
Reputation: 159
I don't understand why Mux's answer is downvoted, as he is right. Nvidia Ampere can't run optimally on CUDA versions < 11.1, as Ampere streaming multiprocessor (SM_86) are only supported on CUDA 11.1, see https://forums.developer.nvidia.com/t/can-rtx-3080-support-cuda-10-1/155849/2
However, the direct solution to your issue without updating CUDA could possibly be achieved by increasing default JIT cache size with 'export CUDA_CACHE_MAXSIZE=2147483648', by setting that environment variable to 2147483648 (4GB). You will still have this long wait on first start up thought, see https://www.tensorflow.org/install/gpu#hardware_requirements
Upvotes: 6
Reputation: 71
RTX3090 has Amper Architecture which requires Cuda 11+. Checkout this guide: https://medium.com/@dun.chwong/the-simple-guide-deep-learning-with-rtx-3090-cuda-cudnn-tensorflow-keras-pytorch-e88a2a8249bc
Upvotes: 5