Reputation: 439
How does one implement these within OpenCL kernel? Is 'return;' equivalent to 'break;'?
I am using openCL 1.2
I want to implement this with 3 nested for loops that cycle through a typedef struct of nested arrays.
EDIT
Realised I need to show some code to be better understood
IN KERNEL...
typedef struct tag_sfextras
{
float *high;
float *low;
}sfextras;
typedef struct tag_sdirection
{
int time;
float result;
sfextras *fextras;
}sdirection;
__kernel void Call(sdirection *_direction,
int _index,
int _start,
int _stop,
__global float *_result)
{
float _sum = 0.0f;
if (_index > 1)
{
_result[0] = 0.0f;
int i = get_global_id(0);
if (_direction[i].time >= _stop)
{
break;//or return?...
}
if (_direction[i].time < _start)
{
continue;// what to put here?...
}
else
{
_start = _direction[i].time + (1440 * 60);
}
int d = get_global_id(1);
int f = get_global_id(2);
float _fextras_weight = 0.0f;// need to zeroize on each inner loop (for f)
_fextras_weight += (float)pow(_direction[_index - 1].fextras[d].high[f] - _direction[i].fextras[d].high[f], 2.0f);
_fextras_weight += (float)pow(_direction[_index - 1].fextras[d].low[f] - _direction[i].fextras[d].low[f], 2.0f);
_result[0] += _fextras_weight*_direction[i].result;
_sum += _fextras_weight;
}
if (_sum > 0.0f)
{
_result[0] /= _sum;
}
}
IN HOST (the code I am trying to replicate in kernel for efficiency)
if(_direction_index > 1)
{
_fextras = 0.0f;
for(int i=0;i<_direction_index-1;i++)
{
if(_direction[i].time >= _stop)
{
break;
}
if(_direction[i].time < _start)
{
continue;
}
else
{
_direction_start = _direction[i].time + (1440*60);
}
for(int d=0;d<_DIRECTION;d++)
{
for(int f=0;f<_FEXTRAS;f++)
{
float _fextras_weight = 0.0f;
_fextras_weight += (float)pow(_direction[_direction_index-1].fextras[d].high[f]-_direction[i].fextras[d].high[f],2.0f);
_fextras_weight += (float)pow(_direction[_direction_index-1].fextras[d].low[f]-_direction[i].fextras[d].low[f],2.0f);
_fextras += _fextras_weight*_direction[i].result;
_sum += _fextras_weight;
}
}
}
if(_sum > 0.0f)
{
_fextras /= _sum;
}
}
Upvotes: 3
Views: 2730
Reputation: 11920
Canceling all other threads of opencl would make them undefined behaviour because many of them could be in the middle of writing/reading global/local memory and that could also purge the thread running(stop all other/remaining kernels/threads). Probably thats why there is no such thing in opencl.
But you can add an output array that each thread writes its last state. If an element has "return" code, you should check for "after_return" code to ommit those results' calculations results and accept "before_return" ones. This would also need atomic operations at the output stage so becomes slower which is bad.
But you can safely return from individual kernels:
Below code compiled well and early-quitted(ended kernel execution for some threads but not all) by the 'return' on a HD7870 and a R7-240 without error because 'return' is not one of the constraints applied by OpenCL.
__kernel void rarToVideo(__global int * p,__global char * c)
{
...
if (tmp)
{
foo=1;
}
else
{
return;
}
...
}
Used opencl 1.2 headers of c++.
But, if you still need fake-return and a thread doesnt affect other threads' outputs/inputs, then something like this would help:
// beginning phase of this thread
if(globalAtomicElement[0]>=RETURNED)
{
// finished this thread so it doesn't waste ALU/LD-ST/....
// leaves room for other wavefronts at least
outputState[threadId]=NOT_STARTED;
return;
}
...
...
// ending phase of this thread
// localState has information if this thread needed a "return"
// 0=NOT_RETURNED
// 1=RETURNED
// 2=NOT_STARTED
lastResult=atomic_add(globalAtomicElement,localState);
if(lastResult>=RETURNED)
{
outputState[threadId]=AFTER_RETURNED; // you ommit
// this thread's result
// because an other thread
// pretends to stop all
// so this thread wasted cycles but dont worry,
// it would always waste even if you don't use
// a core for GCN 1.0 - GCN 3.0 architectures
// a core always spin within a compute unit if a
// core/shader is working on something.
// polaris architecture will have ability
// to shut down unused cores so that will not be
// a problem of power consumption either.
}
else if(lastResult==NOT_RETURNED && thisThreadReturned)
{
outputState[threadId]=RETURNED; // this is returning
// thread
//(finishing,pretending to stop all)
}
else if(lastResult==NOT_RETURNED && !thisThreadReturned)
{
outputState[threadId]=BEFORE_RETURNED; // you accept this thread's
// results because no thread
// has ever stopped and this
// thread surely computed
//everything before that
}
then in host side, you check/filter-in only results of "BEFORE_RETURNED" and "RETURNED" and elliminate results of "AFTER_RETURNED".
In opencl 2.0, you can try this:
this could save half of threads at least (or 1/4 or 1/8 ... or 1/N) but would be slow since only 2 threads are inefficient.
Upvotes: 2