Reputation: 21
I have a tensorflow model with multiple inputs and several layers, and a final softmax layer. The model is trained in Python (using the Keras framework), then saved and inference is done using a C++ program that facilitates a CMake build of TensorFlow (following basically those instructions: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/cmake).
In python (tensorflow-gpu) all ops use the GPU (using log_device_placement
):
out/MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.005837: I C:\tf_jenkins\home\workspace\rel-in\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:872] out/MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0
out/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.006201: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:872]
out/BiasAdd: (BiasAdd)/job:localhost/replica:0/task:0/gpu:0
out/Softmax: (Softmax): /job:localhost/replica:0/task:0/gpu:0
2017-12-04 14:07:38.006535: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\simple_placer.cc:872] out/Softmax: (Softmax)/job:localhost/replica:0/task:0/gpu:0
To save the graph, the freeze_graph
script is used (the script producing the log above loads again the freezed graph in .pb format).
When I use the C++ program and load the freezed graph (following closely the LoadGraph()
function in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/label_image/main.cc - ReadBinaryProto()
and session->Create()
), and log again the device placements, I find that the Softmax is placed on CPU (all others ops are on GPU):
dense_6/MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
dense_6/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/device:GPU:0
dense_6/Relu: (Relu): /job:localhost/replica:0/task:0/device:GPU:0
out/MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
out/BiasAdd: (BiasAdd): /job:localhost/replica:0/task:0/device:GPU:0
out/Softmax: (Softmax): /job:localhost/replica:0/task:0/device:CPU:0
This placement is also confirmed by high CPU/low GPU utilization, and also apparent from profiling the application. The data type of the out
layer is float32
(out/Softmax -> (<tf.Tensor 'out/Softmax:0' shape=(?, 1418) dtype=float32>,)
).
Further investigation revealed:
Cannot assign a device for operation 'tsoftmax': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
A call to tensorflow::LogAllRegisteredKernels()
showed also that Softmax is only available for CPU!
The build directory contains many files related to "softmax" (e.g. `tf_core_gpu_kernels_generated_softmax_op_gpu.cu.cc.obj.Release.cmake). Don't know how to check every compilation step, though.
when I look into the "tf_core_gpu_kernels.lib" (one can open a .lib with 7Z ;)), there are files like "tf_core_gpu_kernels_generated_softmax_op_gpu.cu.cc.lib
" - so I believe there is nothing wrong with compiling the kernels itself
But: inspecting the "tensorflow.dll" (Dependency Walker) shows that only CPU kernels for Softmax are included (there are functions like const tensorflow::SoftmaxOp<struct Eigen::ThreadPoolDevice,double>
, but no functions with GPU such as const tensorflow::SoftplusGradOp<struct Eigen::GpuDevice,float>
).
Setup: Tensorflow 1.3.0, Windows 10, GPU: NVidia GTX 1070 (8GB RAM, memory utilization also very low).
Upvotes: 2
Views: 1038
Reputation: 21
I found a workaround - the workaround is to include the tf_core_gpu_kernels.lib
in some of the steps (create_def_file.py
). More details here: GitHub Issue 15254
Upvotes: 0