jiandingzhe
jiandingzhe

Reputation: 2121

Got completely confused on how to OpenCL data transfer

I'm learning OpenCL and attempt to utilize it on some low-latency scenario, so I'm really concerned with the memory transferring delay.

According to NVidia's OpenCL Best Practices Guide, and also by many other places, direct read/write on buffer object should be avoided. Instead, we should use map/unmap utility. In that guide, a demonstrative code is given like this:

cl_mem cmPinnedBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, memSize, NULL, NULL);
cl_mem cmDevBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, memSize, NULL, NULL);
unsigned char* cDataIn = (unsigned char*) clEnqueueMapBuffer(cqCommandQue, cmPinnedBufIn, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, NULL);

for(unsigned int i = 0; i < memSize; i++)
{
    cDataIn[i] = (unsigned char)(i & 0xff);
}

clEnqueueWriteBuffer(cqCommandQue, cmDevBufIn, CL_FALSE, 0, szBuffBytes, cDataIn , 0, NULL, NULL);

In this code snippet, two buffer objects are generated explicitly, and a write-to-device operation is also explicitly called.

If my understanding is correct, when you call clCreateBuffer with CL_MEM_ALLOC_HOST_PTR OR CL_MEM_USE_HOST_PTR, the storage of buffer object is created in on host side, probably in DMA memory, and no storage is allocated on device side. So the above code actually creates two separated storage. If so:

Upvotes: 1

Views: 727

Answers (1)

Dithermaster
Dithermaster

Reputation: 6333

Unfortunately, it's different for each vendor.

NVIDIA claims their best bandwidth is when you use read/write buffer where the host memory is "pinned", which can be achieved by creating a buffer with CL_MEM_ALLOC_HOST_PTR and mapping it (I think your example is that). You should also compare that to just mapping and unmapping the device memory; their more recent drivers have gotten better at that.

With AMD you can just map/unmap the device buffer to get full speed. They also have a bunch of vendor-specific buffer flags which can make certain scenarios faster; you should study them but more importantly create benchmarks that try out everything to see what actually works best for your task.

With both discrete devices you should use separate command queues for the transfer operations so they can overlap with other (non-dependent) compute operations (look up various compute overlap examples). Furthermore, some higher end discrete GPUs can be downloading one buffer at the same time they are uploading another (using dual DMA engines), so you could be uploading one batch of work while you're computing another while you're downloading the result of a third. When written elegantly, this isn't even much more code than the strictly sequential version, but you have to use OpenCL events to synchronize between command queues. NVIDIA has a GTC talk you can watch that shows how to do this for video frames every 16 ms.

With AMD's APU and with Intel's Integrated Graphics, the map/unmap of the "device" buffer is "free" since it is in main memory. Don't use read/write buffer here or you'll be paying for unneeded transfers.

Upvotes: 3

Related Questions