Andrew Stephens
Andrew Stephens

Reputation: 10193

Summing a stream of records using OpenCL

We acquire data from an external device in the form of a continuous stream of "records" of 16-bit values. The record length is configured before the streaming starts and can vary between 20,000 and 50,000 (values, not bytes). This remains the same for the duration of the acquisition, i.e. all records in the stream will be the same size.

The acquired records get written into a GPU buffer, and once the buffer contains a pre-determined number of records (say 20,000) then they are processed by a series of kernels. The first of these sums every N neighbouring records to create one new record (of the same length) containing the summed values. N is the "number of records to accumulate" and can vary between 2 and 100.

Once those 20,000 records have been processed by the kernels, the whole process repeats with another 20,000 records being acquired, written to the GPU, processed, and so on.

This is the existing summing (accumulation) kernel:

__kernel void firstPassAccumulation(
    __global const short* inputBuffer,
    __global float* outputBuffer,
    const int recordLength, 
    const int numAccums)
{
    int blockNumber = get_global_id(0); // Index of the "block" of 'numAccums' records to accumulate
    int i = get_global_id(1); // Index of the value to sum within the record

    int blockStart = (blockNumber * numAccums * recordLength);

    float sum = 0;
    for (int rec = 0; rec < numAccums; rec++)
    {       
        sum += inputBuffer[blockStart + (rec * recordLength) + i];
    }

    outputBuffer[blockNumber * recordLength + i] = sum;
}

The arguments are:

Global work size is [x,y] where:

As an example, say the number of accumulations = 4, record length = 30000, and the "input" buffer contains 20,000 records. The global work size will therefore be: [5000,30000].

Each kernel instance (work item) is responsible for summing a single value from across the 4 records in its assigned "block", then placing that summed value in the relevant position in the output buffer, which will contain 5,000 summed records once the kernel has finished executing.

Local work size is NULL by the way, and we use Radeon Pro WX7100 cards if that is relevant.

This kernel has been working fine and I wouldn't say performance is a cause for concern, so I'm just curious to know whether it could be improved. This was our first (and only) OpenCL app so I'm sure there are plenty of things that we haven't done well. I've always struggled to understand the concept of memory coalescing, but I'm assuming that this kernel won't be very efficient due to the way the "for" loop has to jump from one record to the next as it sums the Nth value from each one.

Upvotes: 0

Views: 45

Answers (1)

Simon Goater
Simon Goater

Reputation: 1898

There's a strong argument for keeping things simple in OpenCL programming. If your code works, and the performance isn't an issue, I suggest you don't try to fix it.

Having said that, I think there are opportunities for performance improvements here, though not with trivial changes. Each element of your output requires numAccums additions. If I have understood your application correctly, you could reduce this to two for the general case, no matter the value of numAccums by keeping a running total which adds the element from the next record and subtracts the element from the record that has passed. In practice, If you change it to work as I suggested, I would recommend using long integer arithmetic for the sum as it is exact, and then cast the result to float if that's what you want your end result to be. My suggestion doesn't require all the intermediate records to be available in cache memory so there could be performance benefits from that too.

Upvotes: 0

Related Questions