Reputation: 793

OpenCL work-items per work-groups in a 2DRange

This code here represents the matrix multiplication, the code is written using OpenCL. The size of the three matrixes (2 in - 1out) is 1024x1024.

Talking about OpenCL implementation the range of execution is bidimensional so we have 1024x1024 work-groups, each of whom is composed by 16x16 work-items.

The question is, why should we set the size of each work-group since in the kernel we are neither using local memory nor get_local_id() calls? Setting to null the work-groups dimension wouldn't be better so that each work-group works on filling each cell of the output matrix?

To me, reading the kernel code (at the bottom of the page I linked), it seems like each work-group is ready to work with 16x16 work-items but at the end they remain unused. I would set the local size to NULL. Why do they use 16x16, what does improve? I'm very confused.

Upvotes: 1

Answers (2)

Prashant Ravi

Reputation: 209

The local work size is set in the clEnqueueNDRange just to tell the OpenCL device to try grouping work items from the global work group for utilizing parallelism. If you keep this as null then OpenCL will select an appropriate local work size and execute your kernel. Ideally, the global work size should be divisible by the local work size and the local work size should be an integral multiple of the number of computing units on your device. So even if you give your local work size as null I guess OpenCL will still choose an appropriate value and launch computations in parallel. I have seen scenarios where giving local work size as null makes my code go faster than actually giving a value there.

P.S.: - I'm still a rookie in OpenCL, forgive me for any minor mistakes.

Upvotes: 2

user3813674

Reputation: 2683

The question is, why should we set the size of each work-group since in the kernel we are neither using local memory nor get_local_id() calls? Setting to null the work-groups dimension wouldn't be better so that each work-group works on filling each cell of the output matrix?

True, we are not using local memory or get_local_id(). However, the size of your work groups and how many work groups you decide to launch determine the whole computation domain that executes in parallel (in CUDA we call it grid). This in turns decide your get_global_id() which is used in this program.

If you set all the work-group dimensions to null. That is, if I understand correctly, you mean 1x1 work-groups?. This will cause an enormous number of work-groups and the streaming multiprocessors (SMs) can only support a limited number of work-groups (each SM in Testla and Fermi can support up to 8 work-groups).

Hope that helps.

Upvotes: 0

OpenCL work-items per work-groups in a 2DRange

Answers (2)

Related Questions