Reputation: 81
I have some code that is in a similar form to this:
std::queue<unsigned> q;
q.push(array[0]);
while (!q.empty())
{
unsigned index = q.front();
q.pop();
if(index >= someval){
index = index - someval;
q.push(array[index]);
}
else{
//do something
}
}
}
I would like to move this over to an OpenCL kernel, how would I replicate the functionality of the queue in the most efficient way possible.
At the moment I have implemented it with a fixed sized array which monitors the head and count of elements in the queue. I am wondering if there is a more elegant solution.
Thanks
Upvotes: 3
Views: 417
Reputation: 11920
You should check SYCL: https://www.khronos.org/sycl where where C++ template functions can contain both host and device code to construct complex algorithms that use OpenCL acceleration. If you can't use that and if you don't even have a device for opencl 2.0:
There isn't any equivalent for version 1.2. But in comments, you said:
Not sure that I understand what you mean but this would need to be individual to all threads on the GPU
Then there is no need for inter-thread communication, it should be simple to implement a FIFO one using a circular buffer for cache+memory+compute efficiency, just don't overflow it(64 elements max) and don't underflow(more pops than pushes but easy to implement a bounds check (with performance penalty)):
push
bool push(__private uint * stack, uint value)
{
// pushing from bot, so you can pop it from top later (FIFO)
// circular buffer for top performance
uint bufLen=64;
// zeroth element is counter for newest added element
// first element is oldest element
// circular buffer
uint nextIndex=(stack[0]%bufLen+2); // +2 because of top-bot headers
// if overflows, it overwrites oldest elements one by one
stack[nextIndex]=value;
// if overflows, it still increments
stack[0]++;
// simple and fast
return true;
}
checking if empty
bool empty(__private uint * stack)
{
// tricky if you overflow both
return (stack[0]==stack[1]);
}
front value
uint front(__private uint * stack)
{
uint bufLen=64;
// oldest element value (top)
uint ptr=stack[1]%bufLen+2; // circular adr + 2 header
return stack[ptr];
}
popping
uint pop(__private uint * stack)
{
uint bufLen=64;
uint ptr=stack[1]%bufLen+2;
// pop from top (oldest)
uint returnValue=stack[ptr];
stack[ptr]=0;
// this will be new top ctr for ptr
stack[1]++;
// if underflows, gets garbage, don't underflow
return returnValue;
}
example kernel for benchmarking:
__kernel void queue0(__global uint * heap)
{
int id=get_global_id(0);
__private uint q[100];
for(int i=0;i<256;i++)
q[i]=0;
for(int i=0;i<55;i++)
push(q,i);
for(int i=0;i<40;i++)
pop(q);
for(int i=0;i<20;i++)
push(q,i);
for(int i=0;i<35;i++)
pop(q);
for(int i=0;i<35;i++)
{
push(q,i);
pop(q);
}
push(q,'h');
push(q,'e');
push(q,'l');
push(q,'l');
push(q,'o');
push(q,' ');
push(q,'w');
push(q,'o');
push(q,'r');
push(q,'l');
push(q,'d');
for(int i=0;i<256;i++)
heap[id*256+i]=q[i];
}
output of buffer (which shows thread id = 0 calculations results)
121 110 0 0 0 0 0 0 0 0 0 0 // 121 pushes total, 110 pops total
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
104 101 108 108 111 32 119 111 114 108 100 0
// hello world
more than 200k pushes+pops under 6.35 milliseconds(running kernel for 1024 threads each working on 256 elements but using only 64+2 elements for circular buffer) for a 1-channel 1600MHz ddr3 RAM and Intel HD Graphics 400 with 12 compute units(96 cores total @600 MHz).
If you construct a 64-element circular buffer using 64 x 4-element circular buffers, you can add elements between top and bottom of stack too!
Upvotes: 1