What is the best way to handle additional data produced by a small fraction of GPU threads in OpenCL?

Question

I am pretty new to OpenCL and have the following problem:

I have a large array (6 * 1,000,000 floats). For each element of the array I need to do a calculation. The basic algorithm works well on up to 16 GPUs (Tesla K80):

1.) I create a buffer object of the array and a buffer object for the results for each GPU device and write it to each GPU memory.

2.) Then, a thread is spawned for each array element and the calculation is performed within a kernel on the GPUs.

3.) The result is written to the result array element corresponding to the global thread id.

4.) The host reads the result buffer.

I now have to extend this algorithm. A few of the array elements (10-100) actually require an additional calculation that produces an additional result (another 12 floats).

Here is some pseudo-code.

__kernel void calculation(__global float4 *input_array,
                          __global float4 *result_array){

    int id = get_global_id(0);

    //do calculation
    float4 result = some_func(input_array[id]);
    result_array[id] = result;


    if(some_rare_condition){
        //do another, much longer calculation
        float4 result2 = another_func(input_array[id]);
    }
}

The problem is that I have just a few extra results and I do not know what is the best way to store them and let the host read them.

I do not know which array elements require the additional calculation until I have calculated the first result.

If this was C++, I would just create a vector for the additional results and a vector for the indices. However, as far as I am aware, there are no dynamic memory containers within an OpenCL kernel.

If I create a second result array with 1,000,000 elements and just write to the few positions required, it creates a bottleneck when I pass it back to the host.

If I create a smaller array which is definitely larger than required (e.g. 1000 elements), I am not sure how I can let threads write to it safely.

pmdj · Accepted Answer

The simplest solution is probably to use an atomic counter for assigning indices in the smaller array.

The order in which items will appear in the array will not be predictable, so you will need to store identifying information (e.g. the original id) as well and then perhaps sort it, depending on what you need to do with this output next.

However, whether or not this is efficient depends on a number of factors; to make things worse, these factors work against each other.

First of all, it sounds like the probability of requiring an item in this second output array is less than 0.1%. A small number like this is good for atomics - the more work items want to increment that counter, the higher the chance that they will try to do so at the same time and block each other and serialise. Additionally, distribution matters. Any clustering will work against you. If your 0.1% are mostly adjacent, they will also be adjacent in a work group, and the work items in a work group typically run in lock-step on a GPU. So they all need to wait for each other to complete the increment before any of them can proceed.

On the other hand, if another_func() is reasonably computationally expensive (which your question suggests it is), then having an even distribution is bad because most work items in a group will be idling while a single item is running another_func() and occupying the whole compute unit.

Possible variations on the simple approach to counteract the various disadvantages:

Clustering of items taking the branch, or high probability of taking the branch. In this case, you could allocate ranges of the output array for a whole work group at a time, so only one work item in the group increases the counter. You will need to either allocate a whole work group's worth of output slots, even if some are not needed, or run a reduction algorithm across the group to work out how many are needed, and to assign index offsets.
Non-clustered work items with expensive conditional computation. In this case, you may wish to only output the indices of the items that need further computation to an array using the atomic method, and then call another_func() in a second kernel, where each work item works on one of the items that need further computation, so all work items run another_func(), meaning cores are fully occupied.

The two variations can of course be combined if necessary, and there are probably further refinements you can make, especially if you can use your platform's profiler to pinpoint your bottlenecks.

What is the best way to handle additional data produced by a small fraction of GPU threads in OpenCL?

Answers (1)

Related Questions