Do global_work_size and local_work_size have any effect on application logic?

Question

I am trying to understand how all of the different parameters for dimensions fit together in OpenCL. If my question isn't clear that's partly because a well formed question requires bits of the answer which I don't have.

How do work_dim, global_work_size, and local_work_size work together to create the execution space that you use in a kernel? For example, if I make work_dim 2 then I can

get_global_id(0);
get_global_id(1);

I can divide those two dimensions up into n Work Groups using global_work_size, right? So if I make the global_work_size like so

size_t global_work_size[] = { 4 };

Then each dimension would have 4 work groups for a total of 8? But, as a beginner, I am only using global_id for my indices so only the global id's matter anyway. As you can tell I am pretty confused about all of this so any help you can offer would ...help.

image i made to try to understand this question

image decribing work groups i found on google

CaptainObvious · Accepted Answer

Since you stated yourself that you are a bit confused about the concepts involved in the execution space, I'm gonna try to summary them before answering your question and give some examples.

The threads/workitems are organized in a NDRange which can be viewed as a grid of 1, 2, 3 dims. The NDRange is mainly used to map each thread to the piece of data each of them will have to manipulate. Therefore each thread should be uniquely identified and a thread should know which one it is and where it stands in the NDRange. And there come the Work-Item Built-in Functions. These functions can be called by all threads to give them info about themself and the NDRange where they stand.

The dimensions:

As already stated, an NDRange can have up to 3 dimensions. So if you set the dimensions this way:

size_t global_work_size[2] = { 4, 4 };

It doesn't mean that each dimension would have 4 work groups for a total of 8, but that you'll have 4 * 4 i.e. 16 threads in your NDRange. These threads will be arranged in a "square" with sides of 4 units. The workitems can know how many dimensions the NDRange is made of, using the uint get_work_dim () function.

The global size:

Threads can also query how big is the NDRange for a specific dimension with size_t get_global_size (uint D). Therefore they can know how big is the "line/square/rectangle/cube" NDRange.

The global unique identifiers:

Thanks to that organization, each thread can be uniquely identified with indexes corresponding to the specific dimensions. Hence the thread (2, 1) refers to a thread that is in the 3rd column and the second row of a 2D range. The function size_t get_global_id (uint D) is used in the kernel to query the id of the threads.

The workgroup (or local) size:

The NDRange can be split in smaller groups called workgroups. This is the local_work_size you were referring to which has also (and logically) up to 3 dimensions. Note that for OpenCL version below 2.0, the NDRange size in a given dimension must be a multiple of the workgroup size in that dimension. so to keep your example, since in the dimension 0 we have 4 threads, the workgroup size in the dimension 0 can be 1, 2, 4 but not 3. Similarly to the global size, threads can query the local size with size_t get_local_size (uint D).

The local unique identifiers:

Sometime it is important that a thread can be uniquely identified within a workgroup. Hence the function size_t get_local_id (uint D). Note the "within" in the previous sentence. a thread with a local id (1, 0) will be the only one to have this id in its workgroup (of 2D). But there will be as many threads with a local id (1, 0) as there will be workgroups in the NDRange.

The number of groups:

Speaking of groups sometime a thread might need to know how many groups there are. That's why the function size_t get_num_groups (uint D) exists. Note that again you have to pass as parameter the dimension you are interested in.

Each group has also an id:

...that you can query within a kernel with the function size_t get_group_id (uint D). Note that the format of the group ids will be similar to those of the threads: tuples of up to 3 elements.

Summary:

To wrap things up a bit, if you have a 2D NDRange of a global work size of (4, 6) and a local work size of (2, 2) it means that:

the global size in the dimension 0 will be 4
the global size in the dimension 1 will be 6
the local size (or workgroup size) in the dimension 0 will be 2
the local size (or workgroup size) in the dimension 1 will be 2
the thread global ids in the dimension 0 will range from 0 to 3
the thread global ids in the dimension 1 will range from 0 to 5
the thread local ids in the dimension 0 will range from 0 to 1
the thread local ids in the dimension 1 will range from 0 to 1
The total number of threads in the NDRange will be 4 * 6 = 24
The total number of threads in a workgroup will be 2 * 2 = 4
The total number of workgroups will be (4/2) * (6/2) = 6
the group ids in the dimension 0 will range from 0 to 1
the group ids in the dimension 1 will range from 0 to 2
there will be only one thread will the global id (0, 0) but there will be 6 threads with the local id (0, 0) because there are 6 groups.

Example:

Here is a dummy example to use all these concepts together (note that performance would be terrible, it's just a stupid example).

Let's say you have a 2D array of 6 rows and 4 columns of int. You want to group these elements in square of 2 by 2 elements and sum them up in such a way that for instance, the elements (0, 0), (0, 1), (1, 0), (1, 1) will be in one group (hope it's clear enough). Because you'll have 6 "squares" you'll have 6 results for the sums, so you'll need an array of 6 elements to store these results.

To solve this, you use our 2D NDRange detailed just above. Each thread will fetch from the global memory one element, and will store it in the local memory. Then after a synchronization, only one thread per workgroup, let say each local(0, 0) threads will sum the elements (in local) up and then store the result at a specific place in a 6 elements array (in global).

//in is a 24 int array, result is a 6 int array, temp is a 4 int array 
kernel void foo(global int *in, global int *result, local int *temp){
    //use vectors for conciseness
    int2 globalId = (int2)(get_global_id(0), get_global_id(1));
    int2 localId = (int2)(get_local_id(0), get_local_id(1));
    int2 groupId = (int2)(get_group_id (0), get_group_id (1));
    int2 globalSize = (int2)(get_global_size(0), get_global_size(1));
    int2 locallSize = (int2)(get_local_size(0), get_local_size(1));
    int2 numberOfGrp = (int2)(get_num_groups (0), get_num_groups (1));

    //Read from global and store to local
    temp[localId.x + localId.y * localSize.x] = in[globalId.x + globalId.y * globalSize.x];
    //Sync
    barrier(CLK_LOCAL_MEM_FENCE);    
    //Only the threads with local id (0, 0) sum elements up
    if(localId.x == 0 && localId.y == 0){
    int sum = 0;
        for(int i = 0; i < locallSize.x * locallSize.y ; i++){
            sum += temp[i];
        }
    //store result in global
    result[groupId.x + numberOfGrp.x * groupId.y] = sum; 
    }
}

And finally to answer to your question: Do global_work_size and local_work_size have any effect on application logic?

Usually yes because it's part of the way you design you algo. Note that the size of the workgroup is not taken randomly but matches my need (here 2 by 2 square).

Note also that if you decide to use a NDRange of 1 dimension with a size of 24 and a local size of 4 in 1 dim, it'll screw things up too because the kernel was designed to use 2 dimensions.