tensorflow places softmax op on cpu instead of gpu

Question

I have a tensorflow model with multiple inputs and several layers, and a final softmax layer. The model is trained in Python (using the Keras framework), then saved and inference is done using a C++ program that facilitates a CMake build of TensorFlow (following basically those instructions: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/cmake).

In python (tensorflow-gpu) all ops use the GPU (using log_device_placement):

out/MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.005837: I C:	f_jenkins\home\workspace
el-in\M\windows-gpu\PY\35	ensorflow\core\common_runtime\simple_placer.cc:872] out/MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0
out/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.006201: I C:	f_jenkins\home\workspace
el-win\M\windows-gpu\PY\35	ensorflow\core\common_runtime\simple_placer.cc:872] 
out/BiasAdd: (BiasAdd)/job:localhost/replica:0/task:0/gpu:0
out/Softmax: (Softmax): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.006535: I C:	f_jenkins\home\workspace
el-win\M\windows-gpu\PY\35	ensorflow\core\common_runtime\simple_placer.cc:872] out/Softmax: (Softmax)/job:localhost/replica:0/task:0/gpu:0

To save the graph, the freeze_graph script is used (the script producing the log above loads again the freezed graph in .pb format).

When I use the C++ program and load the freezed graph (following closely the LoadGraph() function in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/label_image/main.cc - ReadBinaryProto() and session->Create()), and log again the device placements, I find that the Softmax is placed on CPU (all others ops are on GPU):

dense_6/MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
dense_6/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/device:GPU:0
dense_6/Relu: (Relu): /job:localhost/replica:0/task:0/device:GPU:0
out/MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
out/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/device:GPU:0
out/Softmax: (Softmax): /job:localhost/replica:0/task:0/device:CPU:0

This placement is also confirmed by high CPU/low GPU utilization, and also apparent from profiling the application. The data type of the out layer is float32 (out/Softmax -> (,)).

Further investigation revealed:

Creating the softmax-op in C++ and placing it on GPU explicitly throws this error message:

Cannot assign a device for operation 'tsoftmax': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.

A call to tensorflow::LogAllRegisteredKernels() showed also that Softmax is only available for CPU!
The build directory contains many files related to "softmax" (e.g. `tf_core_gpu_kernels_generated_softmax_op_gpu.cu.cc.obj.Release.cmake). Don't know how to check every compilation step, though.
when I look into the "tf_core_gpu_kernels.lib" (one can open a .lib with 7Z ;)), there are files like "tf_core_gpu_kernels_generated_softmax_op_gpu.cu.cc.lib" - so I believe there is nothing wrong with compiling the kernels itself
But: inspecting the "tensorflow.dll" (Dependency Walker) shows that only CPU kernels for Softmax are included (there are functions like const tensorflow::SoftmaxOp, but no functions with GPU such as const tensorflow::SoftplusGradOp).

Setup: Tensorflow 1.3.0, Windows 10, GPU: NVidia GTX 1070 (8GB RAM, memory utilization also very low).

tensorflow places softmax op on cpu instead of gpu

Answers (1)

Related Questions