Reputation: 817

Optimization of OpenCL kernel

I work in kernel with large array of unsigned characters, I create memory object with clCreateBuffer. Than I copied through clEnqueueWriteBuffer a chunk of unsigned chars to this memory object. And than I call in cycle the kernel which read from this memory object, do some logic and write new data to the same place (I don't call clEnqueueWriteBuffer or clEnqueueReadBuffer in this cycle). Here is the kernel code:

__kernel void test(__global unsigned char *in, unsigned int offset) {
    int grId = get_group_id(0);
    unsigned char msg[1024];
    offset *= grId;

    // Copy from global to private memory
    size_t i;
    for (i = 0; i < 1024; i++)
        msg[i] = in[ offset + i ];

    // Make some computation here, not complicated logic    

    // Copy from private to global memory
    for (i = 0; i < 1024; i++)
        in[ offset + i ] = msg[i];
}

When the cycle is done (the cycle run cca 1000 times) then I read result from memory object through clEnqueueReadBuffer.

It is possible to optimize this code?

Upvotes: 2

Answers (4)

Eric Bainville

Reputation: 9886

Some suggestions:

do a single in += get_group_id(0) * offset at the beginning of the kernel.
read 4 chars at a time (work on uchar4 or uint).
if possible, process 4 chars at a time too.
with a 1K private array in each thread, workgroup size and occupancy will be severely limited, it may be more efficient to run more threads processing less chars.
it seems that all threads in each group will process exactly the same data ; it may not be what you had in mind.

Upvotes: 2

Rick-Rainer Ludwig

Reputation: 2401

For optimization you need to explain what kind of calculations you do. The most benefit for performance can be gotten by grouping your calculations into work groups and to let them work on local memory. You need to pay a lot of attention to the size of your private memory (smallest) and local memory (small).

How often is your kernel called? Do all kernels use the same data? One could think of a local memory buffer where all threads in a work group read a part of the data into the local memory and share the data afterwards. You need to pay a little attention for synchronization.

I suggest having a look into the samples of the SDK vendor. I only know the nVidia SDK. The samples there are quite complex, but very interesting to read.

The change to vector types like float4shall be suitable for ATI boards. nVidia is said to be working best with scalars and internal compiler optimization. This is something for fine tuning later with a profiler. You can gain magnitudes of performance by memory optimization.

Upvotes: 0

Hosane

Reputation: 915

Something that crosses the mind firstly, is that unrolling your loop can help you skip the condition evaluation. You can use this pragma to make it easier.

Using shared memories on Nvidia chips also could greatly help (if your current local mem is not using the shared memory by default)

Upvotes: 0

tauran

Reputation: 8036

You could try the vector version (uchar8 instead of uchar) but the compiler may optimize it this way anyway. Most important profile your code all the time and experiment.

edit

Seems even uchar16 is supported now: http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/vectorDataTypes.html

Upvotes: 1

Optimization of OpenCL kernel

Answers (4)

Related Questions