Reputation: 65

headache for clEnqueueNDRangeKernel local work size

For opencl optimization, my idea is try to make match for

1 workgroup(kernel coding) as compute unit(GPU Hardware) 1 workitem(kernel coding) as process element(GPU Hardware) ( Maybe my idea is not correct, please teach me )

for example: 1. I have a global work size of 4000 by 3000. 2. My GPU opnecl device has a maximum work-group-size of 8192. 3. I call clEnqueueNDRangeKernel with the desired local-work-size (along with all other necessary parameters) 4. by fucntion call: a. clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), (void*)&workGroupSizeUsed, NULL); b. clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), (void*)&workGroupSizeUsed, NULL);

above a and b are return 8192. maximum work-group-size, CL_KERNEL_WORK_GROUP_SIZE, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE all are 8192.

I have no idea what I should follow to define my local work size...

(Q1)Any good idea for setting the local work size? (10x10? 40x30, X by Y )

clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_item_size, local_work_item_size, 0, NULL, NULL); Very headache to define this "local_work_item_size" of clEnqueueNDRangeKernel function.

(Q2) Could some one explain the difference if I set local work size = 1,1 between local work size = 4000,3000 ?

Thank you in advance!

Upvotes: 0

Answers (1)

Stargazer

Reputation: 106

(Q1)Any good idea for setting the local work size? (10x10? 40x30, X by Y )

As pmdj pointed out, this highly depends on your application. Since it is unclear how you selected your global_work_size and it is also linked to the local_work_size I would like to explain that one first. Usually what you would want to do is to map the size of the data you want to process to the global_work_size. E.g. if you have an array with 1024 values you would also pick a global_work_size of 1024 because then you can easily use the global id as an index in your OpenCL program:

int index = get_global_id(0);
input_array[index]++;   // your data processing

However, the global_work_size is limited to a maximum 2^32 - 1. If you have more data to process than that you can pass your global_work_size and data size as parameters and use a loop like the following one:

int index = get_global_id(0);
for (int i = index; i < data_size; i += global_work_size) {
    input_array[i]++;   // your data processing
}

The last fact which is important for the global_work_size is that it needs to be dividable by the local_work_size. This can result into a your global_work_size being bigger than your data size, e.g. you could have 1000 values while your local_work_size is 32. Then you would make your global_work_size 1024 and ensure through a condition like the one above (i < data_size) that the redundant work items are not doing anything weird like accessing not allocated memory areas.

The local_work_size depends on your platform. First of all you should always have a local_work_size which is a multiple of 32 for NVIDIA or a multiple of 64 for AMD GPUs. This represents the amount of operations which are scheduled together anyway. If you use a different number the GPU will have idle threads which won't do anything but decrease your performance.

Not only the manufacturer but also the specific type of your GPU has to be considered to find the optimal local_work_size. The global_work_size divided by the local_work_size is the number of work groups. Each work group is executed by one thread inside your CPU/GPU. If you use OpenCL to run your application on powerful hardware you want to make sure that it runs as parallel as possible. E.g. if you use an Intel i7 with 8 threads you would want to make sure that you have at least 8 work groups (global_work_size / local_work_size >= 8). If you use a NVIDIA GeForce GTX 1060 with 1280 CUDA Cores you would want to have at least 1280 work groups. But never at the cost of having a local_work_size of less than 32 which is more important!

If you are having more work groups than your hardware has threads that does not matter, they will be processed sequentially. Hence for most applications you can always set your local_work_size to 32/64. The only exception is if you require synchronization among more than work items. E.g. barriers only work inside work groups but not among different work groups. An example: If you need to to sum up chunks of 1024 values before being able to proceed with your algorithm you would need to set your local_work_size to 1024 for the barrier to work as desired.

(Q2) Could some one explain the difference if I set local work size = 1,1 between local work size = 4000,3000 ?

Both, the global_work_size and the local_work_size can have more than one dimension. If this is used or not solely depends on the preference of the programmer. All algorithms can be implemented in one dimension as well and the number of work groups is calculated by multiplying the dimensions, e.g. if your global_work_size is 20*20 and your local_work_size is 10*10 you would run the program with (20*20) / (10*10) = 400 work groups.

I personally like to use the dimensions if I am processing data which has multiple dimensions. Imagine your input is a two-dimensional image, you could simply use its width and height as global_work_size (e.g. 1024 * 1024) and the local_work_size accordingly (e.g. 32 * 32). In your code you could then use the following indices:

int x = get_global_id(0);
int y = get_global_id(1);
input_array[x][y]++;   // your data processing

Upvotes: 2

headache for clEnqueueNDRangeKernel local work size

Answers (1)

Related Questions