Reputation: 161
I am trying to use OpenCL to accelerate certain segments of a pre-existing C++ simulation. Currently, I have selected a loop that runs for 1k-1M iterations on every simulation time-step.
To my current understanding, I have to manually write the data to the kernel using enqeueWriteBuffer to the kernel buffers before calling the kernel. I have to do this every time-step, before the kernel is called, so that the kernel operates on the correct data. Is it possible to make the writing of the data on the buffers to happen synchronously with the existing C++ code?
As it stands, before the kernel is requested, the existing C++ code executes another loop, which takes as long as my memory transfers take. This loop does not change or affect the data that I have to write to the kernel before calling it. Is it possible to get the memory transfer to occur synchronously for this period? I would prefer to have to host running the loop, while also writing the data to the buffers at the same time, saving precious simulation time.
Thanks!
Upvotes: 2
Views: 761
Reputation: 6343
First, a wording correction: You're not writing data to the kernel, you're writing it to device, and the kernel uses it later.
To do what you want, you need two command queues, one for the data and the other for the kernel execution. Then use events to ensure the kernel doesn't run until the data transfer is done.
On AMD cards, this will get you what you want. On NVIDIA, your source memory will have to be allocated using clCreateBuffer with the CL_MEM_ALLOC_HOST_PTR flag in order for the clEnqueueWriteBuffer to be asynchronous. See their sample: https://developer.nvidia.com/opencl#oclCopyComputeOverlap
Upvotes: 0
Reputation: 8420
I don't really see a big problem here.
What you simply need is to asynchronously copy the data, while in parallel you perform another operation. That can simply be done with a non-blocking call to clEnqueueWriteBuffer()
.
Additionally, you can even run the kernel in parallel and keep doing the C++ loop in CPU. There is no need to wait since the data of the kernel is independent from the other C++ loop data.
//Get the data for OpenCL ready
...
//Copy it asynchronously
clEnqueueWriteBufer(queue, buffer, CL_FALSE, size, ptr_to_data);
clFlush(); //Ensure it gets executed (not really needed)
//Run the kernel (asynchronously as well)
clENqueueNDRangeKernel(...);
//Do something else
...
//Everything that is clEnqueueXXXX is run in the GPU, and does not take CPU cycles.
//Get the data back
clEnqueueReadBufer(...);
//Wait for the GPU to finish
clFinish(...); //Or, by making the read blocking
Upvotes: 1