Zack Newsham
Zack Newsham

Reputation: 2982

Running parallel OpenCL kernels

I have been looking into OpenCL for a little while, to see if it will be useful in my context, and while I understand the basics, I'm not sure I understand how to force multiple instances of a kernel to run in parallel.

In my situation, the application I want to run is inherently sequential and takes (in some cases) a very large input (hundreds of MB). However, the application in question has a number of different options/flags that can be set which in some cases make it faster, or slower. My hope is that we can re-write the application for OpenCL and then execute each option/flag in parallel, rather than guessing which sets of flags to use.

My question is this: How many kernels can a graphics card run in parallel. Is this something that can be looked at when purchasing? Is it linked to the number of shaders, memory, or the size of the application/kernel?

Additionally, while the input to the application will be the same each execution will modify the data in a different way. Would I need to transfer the input data to each kernel separately to allow for this, or can each kernel allocate "local" memory.

Finally, would this even require multiple kernels, could I use work-items instead? In which case, how do you determine how many work-items can run in parallel?

(reference: http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854?pgno=3)

Upvotes: 0

Views: 4792

Answers (3)

ollj
ollj

Reputation: 31

In favor of compatibility to 2008ish to 2015ish hardware, just assume safely that every gpu can only run one Kernel at any Moment and that Kernels are swapped and compiled on runtume , queued to emulate multiple Kernels. Swapping of Kernels is why large Kernels are better than tiny Kernels.

Single-Kernel Client computeunits are the default. Having The option to run 2 parallel different independent Kernels at the same time is the exception. Assume it to ne rare and unsupported or slower.

Of course 2cpus in one Computer can so that. But as of 2016 having 2 cpus in one system is still a bit too uncommon. Even rarer to have 4.

Some graphiccards may ne able to run 2 cernels in parallel. Assumme them to not so such a thing.

Upvotes: 1

user2746401
user2746401

Reputation: 3406

  • How many kernels can a graphics card run in parallel?

You can look up how many kernel instances (i.e. the same kernel code with different launch ids) can be run in parallel on a graphics card. This is a function of SIMDs/CUs/shaders/etc. depending on what the GPU vendor likes to call them. It gets a little complicated to get an exact number of how many kernel instances really execute as this depends on the occupancy which depends on the resources the kernel uses, e.g. registers used, local memory used.

If you mean how many kernel dispatches (i.e. different kernel code and cl_kernel objects or different kernel arguments) can be run in parallel, then all the GPUs I know of can only run a single kernel at a time. These kernels may be picked up from multiple command queues but the GPU will only process one at a time. This is why cl_ext_device_fission is not supported on current GPUs - there is no way to "split" the hardware. You can do it yourself in your kernel code, though (see below).

  • Can each kernel allocate "local" memory?

Yup. This is exactly what OpenCL local memory is for. However, it is a limited resource so should be thought of a kernel controlled cache rather than a heap.

  • In which case, how do you determine how many work-items can run in parallel?

Same answer as the first question assuming kernel instances.

  • Would this even require multiple kernels, could I use work-items instead?

You can simulate different kernels running by using an uber-kernel that decides which sub-kernel to run based on work item global id. For example:

void subKernel0( .... )
{
    int gid = get_global_id(0);
    // etc.
}

void subKernel1( .... )
{
    int gid = get_global_id(0) - DISPATCH_SIZE_0;
    // etc.
}

__kernel uberKernel( .... )
{
    if( get_global_id(0) < DISPATCH_SIZE_0 )
    {
        subKernel0( .... );
    }
    else if( get_global_id(0) < DISPATCH_SIZE_0 + DISPATCH_SIZE_1 )
    {
        subKernel1( .... );
    }
    else if( .... )
    {
        // etc.
    }
}

The usual performance suggestions for making the dispatch size multiples of 32/64, etc. also apply here. You'll also have to adjust the various other ids as well.

Upvotes: 1

mfa
mfa

Reputation: 5087

Your question seems to pop up from time-to-time in various forums and on SO. The feature you would use to run kernels separately on a hardware level is called device fission. Read more about the extension on this page, or google "cl_ext_device_fission".

This extension has been enabled on CPUs for a long time, but not on GPUs. The very newest graphics hardware might support device fission. You probably need a GPU from at least Q2 2014 or newer, but this will have to be up to you to research.

The way to get kernels to run in parallel using OpenCL software only is to queue them with different command queues on the same device. Some developers say that multiple queues harms performance, but I don't have experience with it personally.

Upvotes: 3

Related Questions