Reputation: 37
Tensorflow session creation fails on GPU node with the below error:
2018-06-19 07:01:08.400165: E tensorflow/core/common_runtime/direct_session.cc:154] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_ECC_UNCORRECTABLE
Below is the GPU info
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000752C:00:00.0 Off | 2 |
| N/A 39C P8 25W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Please share some pointers to debug this further.
PS: Same program runs fine on CPU node
Upvotes: 1
Views: 2629
Reputation: 3743
From this other Stackoverflow
discussion,
I think your GPU state has corrupted bits and ECC or error correcting code couldn't correct it.
According to the discussion restarting the computer may help.
Another thing is that in your GPU info you see Uncorr. ECC
which should be N/A but in your case, it's showing 2. So my suggestion is that you restart the computer and confirm this Uncorr. ECC
as N/A
before running your program. So that you can ensure that your program is not engendering this issue.
Upvotes: 2