user1855952
user1855952

Reputation: 1575

OpenCL - how to effectively distribute work items to different devices

I'm writing an openCL application where I have N work items that I want to distribute to D devices where N > D and in turn each device can process the elements of its own work item in parallel and thus achieve a sort of "double" parallelism.

Here is the code I have written already to try and achieve this.

First I create a an event for each of my devices and set them all to complete:

cl_int err;
cl_event *events = new cl_event[deviceCount];
for(int i = 0; i < deviceCount; i++)
{
    events[i] = clCreateUserEvent(context, &err);
    events[i] = clSetUserEventStatus(events[i], CL_COMPLETE);

}

Each device also has its own command queue and its own "instance" of a kernel.

Then I enter into my "main loop" for distributing work items. The code finds the first available device and enqueues it with the work item.

/*---Loop over all available jobs---*/
for(int i = 0; i < numWorkItems; i++)
{   
    WorkItem item = workItems[i];

    bool found = false; //Check for device availability
    int index = -1;     //Index of found device
    while(!found)       //Continuously loop until free device is found.
    {
        for(int j = 0; j < deviceCount; j++) //Total number of CPUs + GPUs
        {
            cl_int status;
            err = clGetEventInfo(events[j], CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &status, NULL);
            if(status == CL_COMPLETE) /*Current device has completed all of its tasks*/
            {
                found = true; //Exit infinite loop
                index = j;    //Choose current device
                break;        //Break out of inner loop
            }
        }
    }

    //Enqueue my kernel
    clSetKernelArg(kernels[index], 0, sizeof(cl_mem), &item);
    clEnqueueNDRangeKernel(queues[index], kernels[index], 1, NULL, &glob, &loc, 0, NULL, &events[index]);

    clFlush(commandQueues[index]);
}

And then finally I wrap up by calling clFinish on all my devices:

/*---Wait For Completion---*/
for(int i = 0; i < deviceCount; i++)
{
    clFinish(queues[i]);
}

This approach has a few problems however:

1) It doesn't distribute the work to all my devices. On my current computer I have 3 devices. My algorithm above only distributes the work to devices 1 and 2. Device 3 always gets left out because devices 1 and 2 finish so quickly that they can snatch up more work items before 3 gets a chance.

2) Even with devices 1 and 2 running together, I only see a very, very mild speed increase. For instance if i were to assign all work items to device 1 it might take 10 seconds to complete, and if I assign all work items to device 2 it might take 11 seconds to complete, but if I try to split the work between them, combined it might take 8-9 seconds when what I would hope for might be between 4-5 seconds. I get the feeling that they might not really be running in parallel with each other the way I want.

How do I fix these issues?

Upvotes: 1

Views: 1320

Answers (1)

DarkZeros
DarkZeros

Reputation: 8410

You have to be careful with the sizes and the memory location. Typically these factors are not considered when dealing with GPU devices. I would ask you:

  • What are the kernel sizes?
  • How fast do they finish?

    • If the kernel size is small and they finish quite quickly. Then the overhead of launching them will be high. So the finer granularity of distributing them across many devices does not overcome the extra overhead. In that case is better to directly increase the work size and use 1 device only.
  • Are the kernels independent? Do they use different buffers?

    • Another important thing is to have completely different memory for each device, otherwise the memory trashing between devices will delay the kernel launches, and in that case 1 single device (holding all the memory buffers locally) will perform better.
    • OpenCL will copy to a device all the buffers a kernel uses, and will "block" all the kernels (even in other devices) that use buffers that the kernel is writing to; will wait it to finish and then copy the buffer back to the other device.
  • Is the host a bottleneck?

    • The host is sometimes not as fast as you may think, and sometimes the kernels run so fast that the host is a big bottleneck scheduling jobs to them.
    • If you use the CPU as a CL device, then it cannot do both tasks (act as host and run kernels). You should prefer always GPU devices rather than CPU devices when scheduling kernels.
  • Never let a device empty

    • Waiting till a device has finish the execution, before queuing more work is typically a very bad idea. You should queue preemptively kernels in advance (1 or 2) even before the current kernel has finished. Otherwise, the device utilization will not reach not even 80%. Since there is a big amount of time since the kernel finishes till the hosts realizes of it, and even a bigger amount of time until the host queues more data to the kernel (typically >2ms, for a 10ms kernel, thats 33% wasted).

I would do:

  1. Change this line to submitted jobs: if(status >= CL_SUBMITTED)

  2. Ensure the devices are ordered GPU -> CPU. So, the GPUs are the device 0,1 and CPU is the device 2.

  3. Try removing the CPU device (only using the GPUs). Maybe the speed is even better.

Upvotes: 3

Related Questions