elklepo
elklepo

Reputation: 517

How to avoid constant memory copying in OpenCL

I wrote C++ application which is simulating simple heat flow. It is using OpenCL for computing. OpenCL kernel is taking two-dimensional (n x n) array of temperatures values and its size (n). It returns new array with temperatures after each cycle:

pseudocode:

int t_id = get_global_id(0);
if(t_id < n * n)
{
    m_new[t_id / n][t_id % n] = average of its and its neighbors (top, bottom, left, right) temperatures
}

As You can see, every thread is computing single cell in matrix. When host application needs to perform X computing cycles it looks like this

I would like to rewrite kernel code to perform all X cycles without constant memory copying to/from OpenCL device.

  1. Copy memory to OpenCL device
  2. Call kernel X times OR call kernel one time and make it compute X cycles.
  3. Copy memory back

I know that each thread in kernel should lock when all other threads are doing their job and after that - m[][] and m_new[][] should be swapped. I have no idea how to implement any of those two functionalities.

Or maybe there is another way to do this optimally?

Upvotes: 1

Views: 620

Answers (1)

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11920

Copy memory to OpenCL device
Call kernel X times
Copy memory back

this works. Make sure call kernel is not blocking(so 1-2 ms per cycle is saved) and there aren't any host-accesible buffer properties such as USE_HOST_PTR or ALLOC_HOST_PTR.

If calling kernel X times doesn't get satisfactory performance, you can try using single workgroup(such as only 256 threads) with looping X times that each cycles has a barrier() at the end so all 256 threads synchronize before starting next cycle. This way you can compute M different heat-flow problems at the same time where M is number of compute units(or workgroups) if that is a server, it can serve that many computations.

Global synchronization is not possible because when latest threads are launched, first threads are already gone. It works with (number of compute units)(number of threads per workgroup)(number of wavefronts per workgroup) threads concurrently. For example, a R7-240 gpu with 5 compute units and local-range=256, it can run maybe 5*256*20=25k threads at a time.

Then, for further performance, you can apply local-memory optimizations.

Upvotes: 1

Related Questions