Julian Moore
Julian Moore

Reputation: 1028

CUDA issue - how to clean install CUDA in Win 10 to resolve cudaGetDevice() failed

I have previously had CUDA 9.x running on this Win 10 64-bit Home system (targeting 1080Ti card), but need to update to CUDA 10.0 for TensorFlow 2. I initially thought TF2 was OK with CUDA 10.1 and so first installed 10.1 and only later found out that it must be CUDA 10.

Can't get it to work...

To test TF, I ran this to validate the installation(Jupyter notebook via Anaconda - freshly built TF2 environment)

import tensforflow as tf
print(tf.reduce_sum(tf.random.normal([1000, 1000])))

I get this error in the basic Python test

InternalError: cudaGetDevice() failed. Status: cudaGetErrorString symbol not found

This suggests that a key file cannot be found, but I can't work out the root cause - and there are very few hits on that error info, none of which helped me.

Current Config

CUDA 10.0 installed Nvidia driver 436.48 game ready driver

Potential issues & resolution actions so far

Obviously none of them have fixed things

  1. Old CUDA installations - 9.0, 9.1, 10.0, 10.1: all except 10.0 uninstalled and PC rebooted; 10.0 installer then run again
  2. Updating cudnn files: tried 1st with the originals and then cudnn files v7.6.3.30 copied to bin, include, lib as appropriate
  3. Switch from game ready driver to "Studio" driver
  4. Check all environment variables - removed everything that referred to CUDA != 10.0
  5. Update renamed nvcuda.dll to .old in system32 and reran the CUDA 10.0 installer... a new nvcuda was not produced.
  6. Update 2 I found nvcuda64.dll v 10.0.132 in the driver store and replaced nvcuda.dll in system32 with it; after reboot, nvidia-smi now reports no CUDA version at all :(

Known Oddities

  1. [superseded by Update 2] nvidia-smi.exe reports CUDA 10.1 (yes, it is available on my Win 10) - but checking through the registry I can't find anything to suggest CUDA 10.1 is lingering there...Update Found it in C:\Windows\System32

  2. Despite uninstalls, I still have CudaXYZWizardsPackage in the registry under the key Computer\HKEY_USERS.DEFAULT\Software\Microsoft\VisualStudio\14.0_Config\InstalledProducts with XYZ = 90, 91, 100, 101 - but I doubt this is the issue for TF in Python ;) Update there is nothing in C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\IDE\Extensions\NVIDIA except for 10.0 so just orphan reg entries.

Other info

  1. Before doing all the uninstalls etc. I did successfully build and run the Nvidia sample clock project in VS 2017 so the basics were OK (then)

Questions

  1. How can I completely remove all trace of CUDA to start again from a clean slate?
  2. How could I diagnose such issues in future to work out where the issue is/what to do
  3. Can this particular issue be resolved more simply?
  4. (New) Where can I get nvcuda.dll 10.0 to replace in system32? - Answer one possibility is from C:\Windows\System32\DriverStore\FileRepository

Upvotes: 4

Views: 13295

Answers (2)

Julian Moore
Julian Moore

Reputation: 1028

This is mostly an extended comment, since @diego ask for updates...

I now CUDA 10.0 installed and the nVidia control panel reports nvcuda.dll as v 10.0.132

I have built the recommended demo devicequery.exe using Visual Studio 2017 from the vs solution in C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.0\1_Utilities\deviceQuery (note that the .exe ends up in C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.0\bin\win64\Debug)

The program then ran from a cmd prompt and gave the following output.

devicequery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1080 Ti" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 11264 MBytes (11811160064 bytes) (28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores GPU Max Clock rate:
1607 MHz (1.61 GHz) Memory Clock rate:
5505 Mhz Memory Bus Width: 352-bit L2 Cache Size: 2883584 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size:
32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch:
2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support:
Disabled CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model) Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1 Result = PASS

What did I do to achieve this? Hard to be specific because I didn't realise I had succeeded, but I recall setting the display driver to VGA, rebooting (twice for safety) then uninstalling CUDA 10.0, rebooting then installing 10.0.

I did just notice that I built deviceQuery with a vs 2012 solution, but I did agree to VS updating on solution open.

Upvotes: 1

Alexey Golyshev
Alexey Golyshev

Reputation: 812

  1. download and install Anaconda (Python 3.7): https://www.anaconda.com/distribution/

  2. in Command Prompt:

conda update conda
conda update python

conda create --name tensorflow-gpu
conda activate tensorflow-gpu
conda install pip jupyter
pip install tensorflow-gpu
conda install cudatoolkit=10.0 -c pytorch
  1. in Start menu select Anaconda3 (64-bit) -> Jupyter Notebook (tensorflow-gpu)
import tensorflow as tf
%%time
with tf.device('/CPU:0'):
    a = tf.random.uniform([1000,1000])
    b = tf.random.uniform([1000,1000])
c = tf.matmul(a, b)

Wall time: 18.9 ms

%%time
with tf.device('/GPU:0'):
    a = tf.random.uniform([1000,1000])
    b = tf.random.uniform([1000,1000])
c = tf.matmul(a, b)

Wall time: 2.99 ms

Upvotes: 2

Related Questions