Reputation: 517
I wrote C++ application which is simulating simple heat flow. It is using OpenCL for computing. OpenCL kernel is taking two-dimensional (n x n) array of temperatures values and its size (n). It returns new array with temperatures after each cycle:
pseudocode:
int t_id = get_global_id(0);
if(t_id < n * n)
{
m_new[t_id / n][t_id % n] = average of its and its neighbors (top, bottom, left, right) temperatures
}
As You can see, every thread is computing single cell in matrix. When host application needs to perform X computing cycles it looks like this
I would like to rewrite kernel code to perform all X cycles without constant memory copying to/from OpenCL device.
I know that each thread in kernel should lock when all other threads are doing their job and after that - m[][] and m_new[][] should be swapped. I have no idea how to implement any of those two functionalities.
Or maybe there is another way to do this optimally?
Upvotes: 1
Views: 620
Reputation: 11920
Copy memory to OpenCL device
Call kernel X times
Copy memory back
this works. Make sure call kernel
is not blocking(so 1-2 ms per cycle is saved) and there aren't any host-accesible buffer properties such as USE_HOST_PTR or ALLOC_HOST_PTR.
If calling kernel X times doesn't get satisfactory performance, you can try using single workgroup(such as only 256 threads) with looping X times that each cycles has a barrier() at the end so all 256 threads synchronize before starting next cycle. This way you can compute M different heat-flow problems at the same time where M is number of compute units(or workgroups) if that is a server, it can serve that many computations.
Global synchronization is not possible because when latest threads are launched, first threads are already gone. It works with (number of compute units)(number of threads per workgroup)(number of wavefronts per workgroup) threads concurrently. For example, a R7-240 gpu with 5 compute units and local-range=256, it can run maybe 5*256*20=25k threads at a time.
Then, for further performance, you can apply local-memory optimizations.
Upvotes: 1