Reputation: 1575
I'm writing an openCL application where I have N work items that I want to distribute to D devices where N > D and in turn each device can process the elements of its own work item in parallel and thus achieve a sort of "double" parallelism.
Here is the code I have written already to try and achieve this.
First I create a an event for each of my devices and set them all to complete:
cl_int err;
cl_event *events = new cl_event[deviceCount];
for(int i = 0; i < deviceCount; i++)
{
events[i] = clCreateUserEvent(context, &err);
events[i] = clSetUserEventStatus(events[i], CL_COMPLETE);
}
Each device also has its own command queue and its own "instance" of a kernel.
Then I enter into my "main loop" for distributing work items. The code finds the first available device and enqueues it with the work item.
/*---Loop over all available jobs---*/
for(int i = 0; i < numWorkItems; i++)
{
WorkItem item = workItems[i];
bool found = false; //Check for device availability
int index = -1; //Index of found device
while(!found) //Continuously loop until free device is found.
{
for(int j = 0; j < deviceCount; j++) //Total number of CPUs + GPUs
{
cl_int status;
err = clGetEventInfo(events[j], CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &status, NULL);
if(status == CL_COMPLETE) /*Current device has completed all of its tasks*/
{
found = true; //Exit infinite loop
index = j; //Choose current device
break; //Break out of inner loop
}
}
}
//Enqueue my kernel
clSetKernelArg(kernels[index], 0, sizeof(cl_mem), &item);
clEnqueueNDRangeKernel(queues[index], kernels[index], 1, NULL, &glob, &loc, 0, NULL, &events[index]);
clFlush(commandQueues[index]);
}
And then finally I wrap up by calling clFinish on all my devices:
/*---Wait For Completion---*/
for(int i = 0; i < deviceCount; i++)
{
clFinish(queues[i]);
}
This approach has a few problems however:
1) It doesn't distribute the work to all my devices. On my current computer I have 3 devices. My algorithm above only distributes the work to devices 1 and 2. Device 3 always gets left out because devices 1 and 2 finish so quickly that they can snatch up more work items before 3 gets a chance.
2) Even with devices 1 and 2 running together, I only see a very, very mild speed increase. For instance if i were to assign all work items to device 1 it might take 10 seconds to complete, and if I assign all work items to device 2 it might take 11 seconds to complete, but if I try to split the work between them, combined it might take 8-9 seconds when what I would hope for might be between 4-5 seconds. I get the feeling that they might not really be running in parallel with each other the way I want.
How do I fix these issues?
Upvotes: 1
Views: 1320
Reputation: 8410
You have to be careful with the sizes and the memory location. Typically these factors are not considered when dealing with GPU devices. I would ask you:
How fast do they finish?
Are the kernels independent? Do they use different buffers?
Is the host a bottleneck?
Never let a device empty
I would do:
Change this line to submitted jobs: if(status >= CL_SUBMITTED)
Ensure the devices are ordered GPU -> CPU. So, the GPUs are the device 0,1 and CPU is the device 2.
Upvotes: 3