jxj
jxj

Reputation: 145

Is there device side pointer of host memory for kernel use in OpenCL (like CUDA)?

In CUDA, we can achieve kernel managed data transfer from host memory to device shared memory by device side pointer of host memory. Like this:

int  *a,*b,*c;          // host pointers
int *dev_a, *dev_b, *dev_c;     // device pointers to host memory

    …       

cudaHostGetDevicePointer(&dev_a, a, 0); // mem. copy to device not need now, but ptrs needed instead
cudaHostGetDevicePointer(&dev_b, b, 0);
cudaHostGetDevicePointer(&dev_c ,c, 0);

    …   

//kernel launch
add<<<B,T>>>(dev_a,dev_b,dev_c); 
// dev_a, dev_b, dev_c are passed into kernel for kernel accessing host memory directly.

In the above example, kernel code can access host memory via dev_a, dev_b and dev_c. Kernel can utilize these pointers to move data from host to shared memory directly without relaying them by global memory.

But seems that it is an mission impossible in OpenCL? (local memory in OpenCL is the counterpart of shared memory in CUDA)

Upvotes: 3

Views: 2120

Answers (1)

sharpneli
sharpneli

Reputation: 1621

You can find exactly identical API in OpenCL.

How it works on CUDA:

According to this presentation and the official documentation.

The money quote about cudaHostGetDevicePointer :

Passes back device pointer of mapped host memory allocated by cudaHostAlloc or registered by cudaHostRegister.

CUDA cudaHostAlloc with cudaHostGetDevicePointer works exactly like CL_MEM_ALLOC_HOST_PTR with MapBuffer works in OpenCL. Basically if it's a discrete GPU the results are cached in the device and if it's a discrete GPU with shared memory with the host it will use the memory directly. So there is no actual 'zero copy' operation with discrete GPU in CUDA.

The function cudaHostGetDevicePointer does not take raw malloced pointers in, just like what is the limitation in OpenCL. From the API users point of view those two are exactly identical approaches allowing the implementation to do pretty much identical optimizations.

With discrete GPU the pointer you get points to an area where the GPU can directly transfer stuff in via DMA. Otherwise the driver would take your pointer, copy the data to the DMA area and then initiate the transfer.

However in OpenCL2.0 that is explicitly possible, depending on the capabilities of your devices. With the finest granularity sharing you can use randomly malloced host pointers and even use atomics with the host, so you could even dynamically control the kernel from the host while it is running.

http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf

See page 162 for the shared virtual memory spec. Do note that when you write kernels even these are still just __global pointers from the kernel point of view.

Upvotes: 3

Related Questions