Reputation: 379
I am trying to create a new tensorflow GPU op following the instructions on their website.
Looking at their example, it seems they feed a C++ pointer directly into the CUDA kernel without allocating device memory and copying the contents of the host pointer to the device pointer.
From what I understand of CUDA you always have to allocate memory on the device and then use device pointers inside the kernels.
What am I missing? I checked that input_tensor.flat<T>().data()
should return a regular C++ pointer. Here is a copy of the code I am referring to:
// kernel_example.cu.cc
#ifdef GOOGLE_CUDA
#define EIGEN_USE_GPU
#include "example.h"
#include "tensorflow/core/util/cuda_kernel_helper.h"
using namespace tensorflow;
using GPUDevice = Eigen::GpuDevice;
// Define the CUDA kernel.
template <typename T>
__global__ void ExampleCudaKernel(const int size, const T* in, T* out) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size;
i += blockDim.x * gridDim.x) {
out[i] = 2 * ldg(in + i);
}
}
// Define the GPU implementation that launches the CUDA kernel.
template <typename T>
void ExampleFunctor<GPUDevice, T>::operator()(
const GPUDevice& d, int size, const T* in, T* out) {
// Launch the cuda kernel.
//
// See core/util/cuda_kernel_helper.h for example of computing
// block count and thread_per_block count.
int block_count = 1024;
int thread_per_block = 20;
ExampleCudaKernel<T>
<<<block_count, thread_per_block, 0, d.stream()>>>(size, in, out);
}
// Explicitly instantiate functors for the types of OpKernels registered.
template struct ExampleFunctor<GPUDevice, float>;
template struct ExampleFunctor<GPUDevice, int32>;
#endif // GOOGLE_CUDA
Upvotes: 0
Views: 785
Reputation: 420
When you look on https://www.tensorflow.org/extend/adding_an_op at this code lines you will see that the allocation is done in kernel_example.cc:
void Compute(OpKernelContext* context) override {
// Grab the input tensor
const Tensor& input_tensor = context->input(0);
// Create an output tensor
Tensor* output_tensor = NULL;
OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
&output_tensor));
// Do the computation.
OP_REQUIRES(context, input_tensor.NumElements() <= tensorflow::kint32max,
errors::InvalidArgument("Too many elements in tensor"));
ExampleFunctor<Device, T>()(
context->eigen_device<Device>(),
static_cast<int>(input_tensor.NumElements()),
input_tensor.flat<T>().data(),
output_tensor->flat<T>().data());
}
in context->allocate_output(....) they hand over a reference to the output Tensor, which is then allocated. The context knows if it is running on GPU or CPU and allocates the tensor respectively either on host or device. The pointer handed over to CUDA just points then to the actual data within the Tensor class.
Upvotes: 2