André Aguiar
André Aguiar

Reputation: 39

OpenCL performance issue on GPU

I'm using OpenCL to optimize some code in Raspberry Pi GPU (Videocore IV). I'm using VC4CL implementation which offers a maximum work group size of 12.

However, with simple kernels like summing two arrays, the performance of GPU is much worst than CPU's.

For example, for the following kernel:

#define GLOBAL_SIZE 12
#define LOCAL_SIZE  1
#define WIDTH       12*12*12
#define NTIMES      1000

__attribute__ ((reqd_work_group_size(LOCAL_SIZE, LOCAL_SIZE, LOCAL_SIZE)))
__kernel void int_sum(global const uint* A, global const uint* B, global uint* C)
{
  int g_id0 = get_global_id(0);
  int g_id1 = get_global_id(1);
  int g_id2 = get_global_id(2);

  int index = g_id0 + g_id1 * GLOBAL_SIZE + g_id2 * GLOBAL_SIZE * GLOBAL_SIZE;

  for(int k = 0; k < NTIMES; k++)
    C[index + k * WIDTH] = A[index + k * WIDTH] + B[index + k * WIDTH];
}

where two arrays of 1e6 positions are summed, CPU's performance is much better... I've tried to change workgroup to one dimensional and also to use other combinations like (6x6x6 -> global size, 2x2x2 -> local size). Any hint of what I might be doing wrong?

Thanks in advance.

Upvotes: 3

Views: 505

Answers (2)

mogu
mogu

Reputation: 1119

Other than what everyone else has said in comments, according to the RPi's OpenCL implementation author, the GPU "memory access speed" (CPU-GPU memory copy ?) is much slower than the CPU. So an "arithmetically light" kernel like array sum will be limited by memory bandwidth, and can be much slower than on CPU. Plus the theoretical GPU GFlops isn't that much higher than the GPU's (24 vs 6).

Unless you have some very computationally heavy kernels which can also be fully vectorized, you might find using the GPU is simply not worth it.

Upvotes: 1

pmdj
pmdj

Reputation: 23428

I'm not familiar with this particular GPU, but a few things that stand out as possible red flags in your code:

  • This is integer ALU heavy code rather than using floating-point operations. Many GPUs are not optimised for this at all.
  • I wouldn't rely on the compiler optimising the array offset calculations; a particularly stupid compiler might emit 3 integer multiplications for C[index + k * WIDTH] = A[index + k * WIDTH] + B[index + k * WIDTH]; on each loop iteration. I'd keep the offset in a variable and add to it on each iteration, no multiply required.
  • The 1000-iteration for loop generally stands out as a potential source of better parallelism. Many GPUs perform badly with long-running kernels.
  • The memory access patterns seem sub-optimal. Try to arrange things such that adjacent work items in a group access adjacent memory locations. 2x2x2 local size seems like a particularly poor choice for that. Have you tried 12x1x1?
  • Why are you even arranging the work items in this way? It looks like you're literally calculating C[i] = A[i] + B[i] for i = 0..1000*12*12*12. How about writing your kernel as exactly that and submitting 1728000 work items in one dimension? That saves on all the complicated index calculations.

If you can get any kind of feedback from the drivers what the GPU is bound by (ALU, memory loads, thread scheduling, etc.) that's going to help a lot with choosing where to look for ways to speed it up.

Upvotes: 1

Related Questions