Reputation: 680
I'm doing some image processing using OpenCL.
For example, I used a 100*200 size image. In the .cl code, I just half the image pixel value by:
{
int width=get_group_id(0);
int height=get_group_id(1);
// col(width)
int x= get_global_id(0);
// row(height)
int y= get_global_id(1);
(unsigned char) data_output[x*width+y]=
(unsigned char)data_input[x*width+y]/2;
}
After the kernel's parameter setting I run the kernel by:
clEnqueueNDRangeKernel( queue,kernel_DIP,2,NULL,global_work_size,local_work_size, 0,NULL,NULL);
The global_work_size I used is the image size:
size_t global_work_size[2] = {100,200};
I found even the .cl code doesn't include some code like "get_local_id(0);"
The local_work_size did also have lots influence on the performance.
Both "size_t local_work_size[2]= {1,1};"(small local work size) and "size_t local_work_size[2]= {50,50};" (big work size) are slow.
some suitable size like below will be much faster:
size_t local_work_size[2]= {10,10};
So here is my question:
Why a code without get_local_id() also was influenced by the local memory?
How can I set the best local size to make it run in the highest speed?
I also tested the running speed on other platforms such as the freescale's IMX.6, it seems that the changed-size local work-size doesn't work there at all! So why?
If anyone know the answer, plz help. Thank you so much!
Upvotes: 2
Views: 2846
Reputation: 54709
DarkZeros already mentioned that you can set the local work size to null
to let OpenCL choose the size that it considers as "appropriate", given the global work size and the device that it is executed on.
However, for some global work sizes, OpenCL may not be able to choose a "suitable" local work size. Particularly when the global work size is a prime number that is larger than the maximum local work size. Then it might be forced to use a local work size of 1. You may consider padding your input data so that it may be distributed nicely among several workgroups. (I recently wrote a few words about this in https://stackoverflow.com/a/22969485 )
For complex kernels, you may consider querying the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
and base your computation on that, but for this simple kernel, this should not be necessary.
Additionally, you might want to have a look at the "AMD APP KernelAnalyzer" or "NVIDIA Occupancy Calculator" - these tools may give some hints about an appropriate configuration for certain target platforms (although, preferably, the code should be written as generic as possible, as long as it does not have a too sever performance impact)
Upvotes: 3
Reputation: 8420
get_local_id()
does not affect the performance at all. It is only a tool to get the work id inside the kernel, allowing some synchronized tasks inside the group.NULL
Upvotes: 2