Reputation: 2236
Learning OpenCL implementing program which finds 8 most similar pixels within 10 pixels radius for every pixel:
//-DBEST=8
__kernel void best_pixel_matches(
__global const uint * image,
const ushort width,
const ushort height,
// [ width ] [ height ] [ BEST ] [ 2 ]
__global short * best,
__global int * errors
)
{
const int x = get_global_id(0);
const int y = get_global_id(1);
const uint pixel = image[ y * width + x ];
const short SURROUND = 10;
//local best pixels coordinates:
short s[ BEST ][ 2 ];
//... corresponding local best pixels computed deltas (storage instead of recompute speeds program about twice):
uint d[ BEST ];
//... init with none:
for ( short i = 0; i < BEST; ++ i ) d[ i ] = -1;
for ( short sw = - SURROUND; sw < SURROUND; ++ sw ) {
for ( short sh = - SURROUND; sh < SURROUND; ++ sh ) {
//avoid inexistent pixels:
const short tw = x + sw; if ( tw < 0 || tw >= width ) continue;
const short th = y + sh; if ( th < 0 || th >= height ) continue;
//not with itself:
if ( tw == x && th == y ) continue;
const uint diff = (uint) abs( (long)pixel - (long)image[ th * width + tw ] );
for ( short si = 0; si < BEST; ++ si ) {
if ( d[si] == -1 || diff < d[si] ) {
d[si] = diff;
s[si][0] = tw;
s[si][1] = th;
break;//<-- this line causes 4 (!) times slow down
}
}
} }
//copy results from private memory to global:
const long p = ( x * height + y ) * BEST;
for ( short b = 0; b < BEST; ++ b ) {
const long pb = ( p + b ) * 2;
best[ pb ] = s[b][0];
best[ pb + 1 ] = s[b][1];
}
}
The problem was that it took GPU 3593ms for 2560*1440 image which is almost exactly the same time as plain C++
CPU code I had before (~8500ms). Then I tried to tune it here and there and occasionally removed break;
line and execution time -> 900ms!
What's the reason behind such surprising speed up? That break;
line just tells program it's not needed to check any other pixels further so supposed to decrease execution time - not slow it down? Maybe there are some other ways to speed up this program too? :)
Upvotes: 0
Views: 326
Reputation: 11926
Try to maintain memory access order, just before global writing
// here
barrier(CLK_GLOBAL_MEM_FENCE);
//copy results from private memory to global:
const long p = ( x * height + y ) * BEST;
if your performance problem is originating from memory access pattern, this should ease it.
If problem comes from just branching between threads, then you can also try maintaining order within the second loop after innermost loop:
for ( short si = 0; si < BEST; ++ si ) {
if ( d[si] == -1 || diff < d[si] ) {
d[si] = diff;
s[si][0] = tw;
s[si][1] = th;
break;
}
}
barrier(CLK_LOCAL_MEM_FENCE); // local => should be faster for many devices
so that in every iteration, all neighboring threads rejoin together and continue executing same instructions until a new break breaks the order.
Lastly, algorithm seems like sorting pixel value differences between SURROUND pixels, by decreasing values. So, the innermost loop could actually be moved to outermost and have all the branching be minimized, just after innermost two new loops(which were outermost before). But this increases image[ th * width + tw ]
reading by BEST times so it can be slower too(maybe not when data moved to local memory before this). But, d does not need to be an array now so BEST number of private registers are saved and could give it some performance boost by decreasing register pressure.
Upvotes: 1
Reputation: 23438
You don't say what kind of device are you running this on, but:
break;
won't yield any improvement unless there's a high chance of an entire wavefront taking the shortcut. Some GPU models however prefer SIMD code, so you could be running into a similar situation as described for the CPU.Note that removing the break
will presumably fill all BEST items in d
& s
with the same value (if I'm reading your code correctly) so it won't produce the same output as with the break;
.
Upvotes: 2