for-loop interruption slows down OpenCL kernel

Question

Learning OpenCL implementing program which finds 8 most similar pixels within 10 pixels radius for every pixel:

//-DBEST=8

__kernel void best_pixel_matches(
    __global const uint * image,
    const ushort width,
    const ushort height,
    // [ width ] [ height ] [ BEST ] [ 2 ]
    __global short * best,
    __global int * errors
)
{
    const int x = get_global_id(0);
    const int y = get_global_id(1);

    const uint pixel = image[ y * width + x ];

    const short SURROUND = 10;

    //local best pixels coordinates:
    short s[ BEST ][ 2 ];
    //... corresponding local best pixels computed deltas (storage instead of recompute speeds program about twice): 
    uint d[ BEST ];
    //... init with none:
    for ( short i = 0; i < BEST; ++ i ) d[ i ] = -1;

    for ( short sw = - SURROUND; sw < SURROUND; ++ sw ) {
    for ( short sh = - SURROUND; sh < SURROUND; ++ sh ) {
        //avoid inexistent pixels:
        const short tw = x + sw; if ( tw < 0 || tw >= width  ) continue;
        const short th = y + sh; if ( th < 0 || th >= height ) continue;
        //not with itself:
        if ( tw == x && th == y ) continue;
        const uint diff = (uint) abs( (long)pixel - (long)image[ th * width + tw ] );

        for ( short si = 0; si < BEST; ++ si ) {
            if ( d[si] == -1 || diff < d[si] ) {
                d[si] = diff;
                s[si][0] = tw;
                s[si][1] = th;
                break;//<-- this line causes 4 (!) times slow down
            }
        }
    } }

    //copy results from private memory to global:
    const long p = ( x * height + y ) * BEST;
    for ( short b = 0; b < BEST; ++ b ) {
        const long pb = ( p + b ) * 2;
        best[ pb     ] = s[b][0];
        best[ pb + 1 ] = s[b][1];
    }
}

The problem was that it took GPU 3593ms for 2560*1440 image which is almost exactly the same time as plain C++ CPU code I had before (~8500ms). Then I tried to tune it here and there and occasionally removed break; line and execution time -> 900ms!

What's the reason behind such surprising speed up? That break; line just tells program it's not needed to check any other pixels further so supposed to decrease execution time - not slow it down? Maybe there are some other ways to speed up this program too? :)

huseyin tugrul buyukisik · Accepted Answer

Try to maintain memory access order, just before global writing

 // here
 barrier(CLK_GLOBAL_MEM_FENCE);

 //copy results from private memory to global:
 const long p = ( x * height + y ) * BEST;

if your performance problem is originating from memory access pattern, this should ease it.

If problem comes from just branching between threads, then you can also try maintaining order within the second loop after innermost loop:

for ( short si = 0; si < BEST; ++ si ) {
            if ( d[si] == -1 || diff < d[si] ) {
                d[si] = diff;
                s[si][0] = tw;
                s[si][1] = th;
                break;
            }
        }
barrier(CLK_LOCAL_MEM_FENCE); // local => should be faster for many devices

so that in every iteration, all neighboring threads rejoin together and continue executing same instructions until a new break breaks the order.

Lastly, algorithm seems like sorting pixel value differences between SURROUND pixels, by decreasing values. So, the innermost loop could actually be moved to outermost and have all the branching be minimized, just after innermost two new loops(which were outermost before). But this increases image[ th * width + tw ] reading by BEST times so it can be slower too(maybe not when data moved to local memory before this). But, d does not need to be an array now so BEST number of private registers are saved and could give it some performance boost by decreasing register pressure.

for-loop interruption slows down OpenCL kernel

Answers (2)

Related Questions