local and global work sizes in open cl

Question

I am trying to learn open cl but there is a source of confusion i do not understand right now, it is related to such lines

size_t global_item_size = LIST_SIZE; // Process the entire lists 
size_t local_item_size = 64; // Divide work items into groups of 64 
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, 
        &global_item_size, &local_item_size, 0, NULL, NULL);

I understood that kernels are called here on LIST_SIZE number of threads (I mean i got LIST_SIZE kernels in execution, hopefully paralel) [right?] but what does mean

size_t local_item_size = 64; // Divide work items into groups of 64

?

Do this mean thet each thread/kernell is executed bu 64-channel simd-like wavefront? (If so i could call it double parallelisation but maybe im confusing something)

could someone clarify/help to understood that?, [also maybe add some crusial hints what to care of when scheduling of writhing such kernels?]

mfa · Accepted Answer

The work group size of 64 means that the entire workload will be broken into groups of 64. A wavefront is more low-level than the workgroup size, and is hardware/driver controlled -- ie developers can't change the wavefront size.

A work group can share local memory within itself, but not with outer groups. Local memory is normally an order of magnitude faster than the device's global memory. Whenever you need to read a value multiple times, it pays to cache that value in local memory.

I was going to list a long-winded example of when local memory and work groups help performance, but it strays a bit from the scope of your question. I can still post it if you want me to.

EDIT: I feel like this information isn't very useful without an example. Let's say you need to run a blur filter on a 1920x1080 image. Assuming a 2 pixel radius around each pixel (a 5x5 box), and 32-bit pixels (type: float).

Option 1 Have each work item load a square region of pixels, process the target output pixel, and write the result back to global memory. This will generate a correct result, but is inefficient. Every pixel from your source image will be read up to 25 times by various work items, and your global writes will be scattered as well. The other options I will list do not describe the global writes, only the read performance boost. Note that behind the scenes the opencl implementation is still dividing your work into work groups, even though you are not using them in your code.

Option 2a Take advantage of local memory and workgroups to bypass the global memory bottleneck. As fast as global memory is on modern devices, the local memory is still an order of magnitude faster. Local memory usually runs at or near the clock speed of the device. If you process the same image using a work group size of 64, the work items can cooperatively process an 8x8 region of the output image. The work items can load a 12x12 region of pixels into local memory once, and the 25 reads required to process a pixel are read from local memory, thus speeding up performance greatly. The average number of global reads per pixel is 1.497 (vs 25!). 240x135 work groups can process the entire 1920x1080 image in this way.

Option 2b using the work group size of 64, your group can process a larger region of the image. 2a described above uses a work group to process an 8x8 region of output, but it only utilizes 12*12*sizeof(float) (=576 bytes) of local memory. By increasing the output region of a work group, you decrease the border pixels which are doubly or quadruply read. The opencl spec says that devices need to have 32kb of local memory available. A bit of math determines that each work group can safely process a 90x90 square region of pixels -- floor(sqrt(32768/sizeof(float)) = 90. Average global reads per pixel is now 1.041. Only 22x12 work groups are needed.

When you are able to achieve an almost 24:1 reduction in the number of global reads, even a low-end GPU can suddenly become ALU performance bound.

local and global work sizes in open cl

Answers (1)

Related Questions