ndronen
ndronen

Reputation: 1012

NotFoundError running TensorFlow XLA example (libdevice.compute_35.10.bc)

I'm running the tutorial example for XLA using a TensorFlow compiled from source. Running python mnist_softmax_xla.py results in the following error:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
I tensorflow/core/platform/default/cuda_libdevice_path.cc:35] TEST_SRCDIR environment variable not set: using local_config_cuda/cuda under this executable's runfiles directory as the CUDA root.
I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 4 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): <undefined>, <undefined>
I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 4 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform CUDA. Devices:
I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
W tensorflow/core/framework/op_kernel.cc:993] Not found: ./libdevice.compute_35.10.bc not found
Traceback (most recent call last):
  File "/mnt/software/envs/xla/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
    return fn(*args)
  File "/mnt/software/envs/xla/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/mnt/software/envs/xla/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: ./libdevice.compute_35.10.bc not found
         [[Node: cluster_0/_0/_1 = _XlaLaunch[Targs=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], Tconstants=[DT_INT32], Tresults=[DT_FLOAT, DT_FLOAT], function=cluster_0[_XlaCompiledKernel=true, _XlaNumConstantArgs=1], _device="/job:localhost/replica:0/task:0/gpu:0"](Shape_2, _recv_Placeholder_0/_3, _recv_Placeholder_1_0/_1, Variable_1, Variable)]]

I have CUDA 8 installed with cuDNN 5.1. The file libdevice.compute_35.10.bc does exist on the machine:

$ find /usr/local/cuda/ -type f | grep libdevice.compute_35.10.bc
/usr/local/cuda/nvvm/libdevice/libdevice.compute_35.10.bc

My hunch is that this has something to do with the message TEST_SRCDIR environment variable not set: using local_config_cuda/cuda under this executable's runfiles directory as the CUDA root., but I'm not sure what to do about it.

Upvotes: 1

Views: 4170

Answers (2)

Jingyue Wu
Jingyue Wu

Reputation: 191

https://github.com/tensorflow/tensorflow/pull/7079 should be able to fix this. Thanks for the bug report!

Upvotes: 1

Justin L.
Justin L.

Reputation: 4147

The key is this log message:

I tensorflow/core/platform/default/cuda_libdevice_path.cc:35] TEST_SRCDIR environment variable not set: using local_config_cuda/cuda under this executable's runfiles directory as the CUDA root.

(I only noticed it in the logs later; I actually found that file by digging around in the sources and only then noticed the message in the logs.)

For reasons that I do not currently understand, XLA does not look in /usr/local/cuda (or whatever directory you gave when you ran ./configure) for libdevice. Per cuda_libdevice_path.cc [1], it's looking for a symlink that was created specifically to point it to libdevice.

I'm going to loop in the person who wrote this code to figure out what it's supposed to be doing. In the meantime, I was able to work around it myself as follows:

$ mkdir local_config_cuda
$ ln -s /usr/local/cuda local_config_cuda/cuda
$ TEST_SRCDIR=$(pwd) python my_program.py

The important thing is to set TEST_SRCDIR to the parent of the local_config_cuda directory.

Sorry for the trouble, and sorry I don't have a less-hacky answer for you right now.

[1] https://github.com/tensorflow/tensorflow/blob/e1f44d8/tensorflow/core/platform/cuda_libdevice_path.cc#L23 https://github.com/tensorflow/tensorflow/blob/1084748efa3234c7daa824718aeb7df7b9252def/tensorflow/core/platform/default/cuda_libdevice_path.cc#L27

Upvotes: 2

Related Questions