Slaus
Slaus

Reputation: 2236

for-loop interruption slows down OpenCL kernel

Learning OpenCL implementing program which finds 8 most similar pixels within 10 pixels radius for every pixel:

//-DBEST=8

__kernel void best_pixel_matches(
    __global const uint * image,
    const ushort width,
    const ushort height,
    // [ width ] [ height ] [ BEST ] [ 2 ]
    __global short * best,
    __global int * errors
)
{
    const int x = get_global_id(0);
    const int y = get_global_id(1);

    const uint pixel = image[ y * width + x ];

    const short SURROUND = 10;

    //local best pixels coordinates:
    short s[ BEST ][ 2 ];
    //... corresponding local best pixels computed deltas (storage instead of recompute speeds program about twice): 
    uint d[ BEST ];
    //... init with none:
    for ( short i = 0; i < BEST; ++ i ) d[ i ] = -1;

    for ( short sw = - SURROUND; sw < SURROUND; ++ sw ) {
    for ( short sh = - SURROUND; sh < SURROUND; ++ sh ) {
        //avoid inexistent pixels:
        const short tw = x + sw; if ( tw < 0 || tw >= width  ) continue;
        const short th = y + sh; if ( th < 0 || th >= height ) continue;
        //not with itself:
        if ( tw == x && th == y ) continue;
        const uint diff = (uint) abs( (long)pixel - (long)image[ th * width + tw ] );

        for ( short si = 0; si < BEST; ++ si ) {
            if ( d[si] == -1 || diff < d[si] ) {
                d[si] = diff;
                s[si][0] = tw;
                s[si][1] = th;
                break;//<-- this line causes 4 (!) times slow down
            }
        }
    } }

    //copy results from private memory to global:
    const long p = ( x * height + y ) * BEST;
    for ( short b = 0; b < BEST; ++ b ) {
        const long pb = ( p + b ) * 2;
        best[ pb     ] = s[b][0];
        best[ pb + 1 ] = s[b][1];
    }
}

The problem was that it took GPU 3593ms for 2560*1440 image which is almost exactly the same time as plain C++ CPU code I had before (~8500ms). Then I tried to tune it here and there and occasionally removed break; line and execution time -> 900ms!

What's the reason behind such surprising speed up? That break; line just tells program it's not needed to check any other pixels further so supposed to decrease execution time - not slow it down? Maybe there are some other ways to speed up this program too? :)

Upvotes: 0

Views: 326

Answers (2)

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11926

Try to maintain memory access order, just before global writing

 // here
 barrier(CLK_GLOBAL_MEM_FENCE);

 //copy results from private memory to global:
 const long p = ( x * height + y ) * BEST;

if your performance problem is originating from memory access pattern, this should ease it.

If problem comes from just branching between threads, then you can also try maintaining order within the second loop after innermost loop:

for ( short si = 0; si < BEST; ++ si ) {
            if ( d[si] == -1 || diff < d[si] ) {
                d[si] = diff;
                s[si][0] = tw;
                s[si][1] = th;
                break;
            }
        }
barrier(CLK_LOCAL_MEM_FENCE); // local => should be faster for many devices

so that in every iteration, all neighboring threads rejoin together and continue executing same instructions until a new break breaks the order.

Lastly, algorithm seems like sorting pixel value differences between SURROUND pixels, by decreasing values. So, the innermost loop could actually be moved to outermost and have all the branching be minimized, just after innermost two new loops(which were outermost before). But this increases image[ th * width + tw ] reading by BEST times so it can be slower too(maybe not when data moved to local memory before this). But, d does not need to be an array now so BEST number of private registers are saved and could give it some performance boost by decreasing register pressure.

Upvotes: 1

pmdj
pmdj

Reputation: 23438

You don't say what kind of device are you running this on, but:

  • On the CPU, optimisation includes autovectorising, which typically is very sensitive to branches. So a factor of 4 isn't surprising there; your loop iteration counts are known at build time, so they can be completely unrolled and vectorised if there's no early-out. With the early-out, the code likely needs to be serialised, so you lose the 4x, 8x, or 16x speedup from vectorisation (depending on CPU type) and only make some of that back from the early-out.
  • GPU work-items typically run in lock-step threads, but a factor 4 change seems unlikely if that's happening here; it's typically more the case that the break; won't yield any improvement unless there's a high chance of an entire wavefront taking the shortcut. Some GPU models however prefer SIMD code, so you could be running into a similar situation as described for the CPU.

Note that removing the break will presumably fill all BEST items in d & s with the same value (if I'm reading your code correctly) so it won't produce the same output as with the break;.

Upvotes: 2

Related Questions