Reputation: 3993
OpenCL standard defines the following options to get info about device and compiled kernel:
CL_DEVICE_MAX_COMPUTE_UNITS
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_KERNEL_WORK_GROUP_SIZE
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
Given this values, how can I calculate the optimal size of work group and number of work groups?
Upvotes: 12
Views: 6678
Reputation: 1
As mfa said, you have to discover these experimentally. I wanted to add that depending on what you are computing (particularly size of the jobs, i.e. smaller or larger for each work item), sometimes a good try can be:
That is, basically check base cases and figure out how it affects the processing pipeline.
In essence you have to tweak it. I often execute several times for different parameters (profile it) and then generate a surface plot to see how it behaves.
Upvotes: 0
Reputation: 5087
You discover these values experimentally for your algorithm. Use a profiler to get hard numbers.
I like to use CL_DEVICE_MAX_COMPUTE_UNITS as the number of work groups, because I often rely on synchronizing work items. I usually run kernels with little branching, so the take the same time to execute in each compute unit.
Some multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will be optimal for your device. What that multiple actually is depends on your memory access pattern and type of work you are doing with each work item. Use 1 as the multiple when you are running a heavy, compute-bound (ALU) kernel. Try a larger multiple to hide memory latency if you are bottlenecked by memory access. Use a profiler to determine when your access time and your ALU time are optimal.
Optimal ratio for ALU to fetch is 1:1 for any device. This is rarely achieved in practice, so you want to keep the ALU/SIMD banks saturated. This means ALU:fetch should be greater than 1 whenever possible. Less than 1 means you should try a larger work group size to better hide the memory latency.
Upvotes: 9