KPW
KPW

Reputation: 37

Nvidia GPU error while using tensorflow

Tensorflow session creation fails on GPU node with the below error:

2018-06-19 07:01:08.400165: E tensorflow/core/common_runtime/direct_session.cc:154] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_ECC_UNCORRECTABLE

Below is the GPU info

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000752C:00:00.0 Off |                    2 |
| N/A   39C    P8    25W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Please share some pointers to debug this further.

PS: Same program runs fine on CPU node

Upvotes: 1

Views: 2629

Answers (1)

Sumsuddin Shojib
Sumsuddin Shojib

Reputation: 3743

From this other Stackoverflow discussion,

I think your GPU state has corrupted bits and ECC or error correcting code couldn't correct it.

According to the discussion restarting the computer may help.

Another thing is that in your GPU info you see Uncorr. ECC which should be N/A but in your case, it's showing 2. So my suggestion is that you restart the computer and confirm this Uncorr. ECC as N/A before running your program. So that you can ensure that your program is not engendering this issue.

Upvotes: 2

Related Questions