Summing a stream of records using OpenCL

Question

We acquire data from an external device in the form of a continuous stream of "records" of 16-bit values. The record length is configured before the streaming starts and can vary between 20,000 and 50,000 (values, not bytes). This remains the same for the duration of the acquisition, i.e. all records in the stream will be the same size.

The acquired records get written into a GPU buffer, and once the buffer contains a pre-determined number of records (say 20,000) then they are processed by a series of kernels. The first of these sums every N neighbouring records to create one new record (of the same length) containing the summed values. N is the "number of records to accumulate" and can vary between 2 and 100.

Once those 20,000 records have been processed by the kernels, the whole process repeats with another 20,000 records being acquired, written to the GPU, processed, and so on.

This is the existing summing (accumulation) kernel:

__kernel void firstPassAccumulation(
    __global const short* inputBuffer,
    __global float* outputBuffer,
    const int recordLength, 
    const int numAccums)
{
    int blockNumber = get_global_id(0); // Index of the "block" of 'numAccums' records to accumulate
    int i = get_global_id(1); // Index of the value to sum within the record

    int blockStart = (blockNumber * numAccums * recordLength);

    float sum = 0;
    for (int rec = 0; rec < numAccums; rec++)
    {       
        sum += inputBuffer[blockStart + (rec * recordLength) + i];
    }

    outputBuffer[blockNumber * recordLength + i] = sum;
}

The arguments are:

source buffer containing the records to accumulate (sum)
destination buffer, where the accumulated records are written
record length (number of elements, not bytes)
number of records to accumulate

Global work size is [x,y] where:

x is the number of logical "blocks" (groups) of records in the input buffer to be accumulated. A block represents num records to accumulate neighbouring records, and is calculated as "number of records in source buffer / number of records to accumulate"
y is the record length (number of elements, not bytes)

As an example, say the number of accumulations = 4, record length = 30000, and the "input" buffer contains 20,000 records. The global work size will therefore be: [5000,30000].

Each kernel instance (work item) is responsible for summing a single value from across the 4 records in its assigned "block", then placing that summed value in the relevant position in the output buffer, which will contain 5,000 summed records once the kernel has finished executing.

Local work size is NULL by the way, and we use Radeon Pro WX7100 cards if that is relevant.

This kernel has been working fine and I wouldn't say performance is a cause for concern, so I'm just curious to know whether it could be improved. This was our first (and only) OpenCL app so I'm sure there are plenty of things that we haven't done well. I've always struggled to understand the concept of memory coalescing, but I'm assuming that this kernel won't be very efficient due to the way the "for" loop has to jump from one record to the next as it sums the Nth value from each one.

Summing a stream of records using OpenCL

Answers (1)

Related Questions