David Ding
David Ding

Reputation: 680

How to set the OpenCL's local work space size?

I'm doing some image processing using OpenCL.

For example, I used a 100*200 size image. In the .cl code, I just half the image pixel value by:

{
  int width=get_group_id(0);
  int height=get_group_id(1);

  // col(width)
  int x= get_global_id(0);
  // row(height)
  int y= get_global_id(1);
  (unsigned char) data_output[x*width+y]= 
     (unsigned char)data_input[x*width+y]/2;
}

After the kernel's parameter setting I run the kernel by:

clEnqueueNDRangeKernel( queue,kernel_DIP,2,NULL,global_work_size,local_work_size, 0,NULL,NULL);

The global_work_size I used is the image size:

   size_t global_work_size[2] = {100,200};

I found even the .cl code doesn't include some code like "get_local_id(0);"

The local_work_size did also have lots influence on the performance.

Both "size_t local_work_size[2]= {1,1};"(small local work size) and "size_t local_work_size[2]= {50,50};" (big work size) are slow.

some suitable size like below will be much faster:

size_t local_work_size[2]= {10,10};

So here is my question:

  1. Why a code without get_local_id() also was influenced by the local memory?

  2. How can I set the best local size to make it run in the highest speed?

  3. I also tested the running speed on other platforms such as the freescale's IMX.6, it seems that the changed-size local work-size doesn't work there at all! So why?

If anyone know the answer, plz help. Thank you so much!

Upvotes: 2

Views: 2846

Answers (2)

Marco13
Marco13

Reputation: 54709

DarkZeros already mentioned that you can set the local work size to null to let OpenCL choose the size that it considers as "appropriate", given the global work size and the device that it is executed on.

However, for some global work sizes, OpenCL may not be able to choose a "suitable" local work size. Particularly when the global work size is a prime number that is larger than the maximum local work size. Then it might be forced to use a local work size of 1. You may consider padding your input data so that it may be distributed nicely among several workgroups. (I recently wrote a few words about this in https://stackoverflow.com/a/22969485 )

For complex kernels, you may consider querying the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and base your computation on that, but for this simple kernel, this should not be necessary.

Additionally, you might want to have a look at the "AMD APP KernelAnalyzer" or "NVIDIA Occupancy Calculator" - these tools may give some hints about an appropriate configuration for certain target platforms (although, preferably, the code should be written as generic as possible, as long as it does not have a too sever performance impact)

Upvotes: 3

DarkZeros
DarkZeros

Reputation: 8420

  1. Local size affects how the work is performed in the device. Either the code uses or not get_local_id() does not affect the performance at all. It is only a tool to get the work id inside the kernel, allowing some synchronized tasks inside the group.
  2. If your code does not need a specific size (and does not). SImply set it to the default one, that is: NULL
  3. Changing the local size deliverately with "trial and error" is not the way to go. And is likely not to work at all in some cases. The local size has to follow some rules:
    • The total local size (multipliying all the dimensions) cannot be higher than the device maximum local size. (CL_DEVICE_MAX_WORK_GROUP_SIZE)
    • The dimensions cannot be higher than the dimension limits specified in CL_DEVICE_MAX_WORK_ITEM_SIZES.
    • The local work group size has to be an integer divisor of the global size (in all the dimensions).

Upvotes: 2

Related Questions