Reputation: 39
I'm using OpenCL to optimize some code in Raspberry Pi GPU (Videocore IV). I'm using VC4CL implementation which offers a maximum work group size of 12.
However, with simple kernels like summing two arrays, the performance of GPU is much worst than CPU's.
For example, for the following kernel:
#define GLOBAL_SIZE 12
#define LOCAL_SIZE 1
#define WIDTH 12*12*12
#define NTIMES 1000
__attribute__ ((reqd_work_group_size(LOCAL_SIZE, LOCAL_SIZE, LOCAL_SIZE)))
__kernel void int_sum(global const uint* A, global const uint* B, global uint* C)
{
int g_id0 = get_global_id(0);
int g_id1 = get_global_id(1);
int g_id2 = get_global_id(2);
int index = g_id0 + g_id1 * GLOBAL_SIZE + g_id2 * GLOBAL_SIZE * GLOBAL_SIZE;
for(int k = 0; k < NTIMES; k++)
C[index + k * WIDTH] = A[index + k * WIDTH] + B[index + k * WIDTH];
}
where two arrays of 1e6 positions are summed, CPU's performance is much better... I've tried to change workgroup to one dimensional and also to use other combinations like (6x6x6 -> global size, 2x2x2 -> local size). Any hint of what I might be doing wrong?
Thanks in advance.
Upvotes: 3
Views: 505
Reputation: 1119
Other than what everyone else has said in comments, according to the RPi's OpenCL implementation author, the GPU "memory access speed" (CPU-GPU memory copy ?) is much slower than the CPU. So an "arithmetically light" kernel like array sum will be limited by memory bandwidth, and can be much slower than on CPU. Plus the theoretical GPU GFlops isn't that much higher than the GPU's (24 vs 6).
Unless you have some very computationally heavy kernels which can also be fully vectorized, you might find using the GPU is simply not worth it.
Upvotes: 1
Reputation: 23428
I'm not familiar with this particular GPU, but a few things that stand out as possible red flags in your code:
C[index + k * WIDTH] = A[index + k * WIDTH] + B[index + k * WIDTH];
on each loop iteration. I'd keep the offset in a variable and add to it on each iteration, no multiply required.C[i] = A[i] + B[i]
for i = 0..1000*12*12*12
. How about writing your kernel as exactly that and submitting 1728000 work items in one dimension? That saves on all the complicated index calculations. If you can get any kind of feedback from the drivers what the GPU is bound by (ALU, memory loads, thread scheduling, etc.) that's going to help a lot with choosing where to look for ways to speed it up.
Upvotes: 1