Reputation:
I have achieved a small code for doing sum reduction of a 1D array. I am comparing a CPU sequential version and a OpenCL version.
The code is available on this link1
The kernel code is available on this link2
and if you want to compile : link3 for Makefile
My issue is about the bad performances of GPU version :
for size of vector lower than 1,024 * 10^9 elements (i.e with 1024, 10240, 102400, 1024000, 10240000, 102400000 elements
) the runtime for GPU version is higher (slightly higher but higher) than CPU one.
As you can see, I have taken 2^n values in order to have a compatible number of workitems with the size of a workgroup.
Concerning the number of workgroups, I have taken :
// Number of work-groups
int nWorkGroups = size/local_item_size;
But for a high number of workitems, I wonder if the value of nWorkGroups is suitable ( for example, nWorkGroups = 1.024 * 10^8 / 1024 = 10^5 workgroups
, isn't this too much ?? ).
I tried to modify loca_item_size
in the range of [64, 128, 256, 512, 1024]
but the performances remain bad for all these values.
I have good benefits only for size = 1.024 * 10^9
elements, here are the runtimes :
Size of the vector
1024000000
Problem size = 1024000000
GPU Parallel Reduction : Wall Clock = 20 second 977511 micro
Final Sum Sequential = 5.2428800006710899200e+17
Sequential Reduction : Wall Clock = 337 second 459777 micro
From your experiences, why do I get bad performances ? I though that advantages should be more significative compared to CPU version.
Maybe someone could see into source code a main mistake because, at the moment, I can't get to solve this issue.
Thanks
Upvotes: 0
Views: 302
Reputation: 8410
Well I can tell you some reasons:
You don't need to write the reduction buffer. You can directly clear it in GPU memory using clEnqueueFillBuffer()
or a helper kernel.
ret = clEnqueueWriteBuffer(command_queue, reductionBuffer, CL_TRUE, 0,
local_item_size * sizeof(double), sumReduction, 0, NULL, NULL);
Dont use blocking calls, except for the last read. Otherwise you are wasting some time there.
You are doing the last reduction in CPU. Iterative processing trough the kernel can help.
Because if your kernel is just reducing 128 elements per pass. Your 10^9 number just gets down to 8*10^6. And the CPU does the rest. If you add there the data copy, it makes it completely non worth. However, if you run 3 passes at 512 elements per pass, you read out from the GPU just 10^9/512^3 = 8 values. So, the only bottleneck would be the first GPU copy and the kernel launch.
Upvotes: 1