Reputation: 663
I am utilizing OpenCL's enqueue_kernel() function to enqueue kernels dynamically from the GPU to reduce unnecessary host interactions. Here is a simplified example of what I am trying to do in the kernels:
kernel void kernelA(args)
{
//This kernel is the one that is enqueued from the host, with only one work item. This kernel
//could be considered the "master" kernel that controls the logic of when to enqueue tasks
//First, it checks if a condition is met, then it enqueues kernelB
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelB(args);});
}
else
{
//do other things
}
}
kernel void kernelB(args)
{
//Do some stuff
//Only enqueue the next kernel with the first work item. I do this because the things
//occurring in kernelC rely on the things that kernelB does, so it must take place after kernelB is completed,
//hence, the CLK_ENQUEUE_FLAGS_WAIT_KERNEL
if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
}
kernel void kernelC(args)
{
//Do some stuff. This one in particular is one step in a sorting algorithm
//This kernel will enqueue kernelD if a condition is met, otherwise it will
//return to kernelA
if (get_global_id(0) == 0 && other requirements)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelD(args);});
}
else if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
kernel void kernelD(args)
{
//Do some stuff
//Finally, if some condition is met, enqueue kernelC again. What this will do is it will
//bounce back and forth between kernelC and kernelD until the condition is
//no longer met. If it isn't met, go back to kernelA
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
else
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
So that is the general flow of the program, and it works perfectly and does exactly as I intended it to do, in the exact order I intended it to do it in, except for one issue. In certain cases when the workload is very high, a random one of the enqueue_kernel()s will fail to enqueue and halt the program. This happens because the device queue is full, and it cannot fit another task into it. But I cannot for the life of me figure out why this is, even after extensive research.
I thought that once a task in the queue (a kernel for instance) is finished, it would free up that spot in the queue. So my queue should really only reach a max of like 1 or 2 tasks at a time. But this program will literally fill up the entire 262,144 byte size of the device command queue, and stop functioning.
I would greatly appreciate some potential insight as to why this is happening if anyone has any ideas. I am sort of stuck and cannot continue until I get past this issue.
Thank you in advance!
(BTW I am running on a Radeon RX 590 card, and am using the AMD APP SDK 3.0 to use with OpenCL 2.0)
Upvotes: 3
Views: 812
Reputation: 23438
I don't know exactly what's going wrong, but I've noticed a few things in the code you posted and this feedback would be too long/hard to read in comments, so here goes - not a definite answer, but an attempt to get a bit closer:
In kernelD
, you have:
//Finally, if some condition is met, enqueue kernelC again.
…
if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelD(args);});
}
This actually enqueues kernelD
itself again, not kernelC
as the comments suggest. The other condition branch enqueues kernelA
.
This could be a typo in the reduced version of your code.
This could again be down to the way you've abridged the code, but I don't quite see how
So my queue should really only reach a max of like 1 or 2 tasks at a time.
can be true. By my reading, all work items of both kernelC
and kernelD
will spawn new tasks; and as there seems to be more than 1 work item in each case, this seems like it could easily spawn a very large number of tasks:
For example, in kernelC
:
if (get_global_id(0) == 0 && other requirements)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelD(args);});
}
else
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
kernelB
will have created at least 256 work items running kernelC
. Here, work item 0 will (if other requirements
met) spawn 1 task with at least 256 more work items, and 255+ tasks with 1 work-item running kernelA
. kernelD
behaves similarly.
So with a few iterations, you could easily end up with a few thousand tasks for running kernelA
queued. I don't really know what your code does, but it seems like a good idea to check if cutting down these hundreds of kernelA
tasks improves the situation, and whether you can perhaps modify kernelA
so that you just enqueue it once with a range instead of enqueueing a work size of 1 from every work item. (Or something along those lines - perhaps enqueue once per group if that makes more sense. Basically, reduce the number of times enqueue_kernel
gets called.)
enqueue_kernel()
return valueHave you actually checked the return value for enqueue_kernel
? It tells you exactly why it failed, so even if my suggestion above isn't possible, perhaps you can set some global state which will allow kernelA
to restart the calculation once more tasks have drained, if it was interrupted?
Upvotes: 2