saman
saman

Reputation: 311

Performance gap between two almost the same OpenCL kernels

I have two almost the same OpenCL kernels which I want to calculate their performance in GFLOPS. Kernel #1 is:

__kernel void Test41(__global float *data, __global float *rands, int index, int rand_max){

    float16 temp;
    int gid = get_global_id(0);

    temp = data[gid];
    temp = (float) rands[1] * temp;
    temp = (float) rands[2] * temp;
    temp = (float) rands[3] * temp;
    temp = (float) rands[4] * temp;
    .
    .
    .
    temp = (float) rands[497] * temp;
    temp = (float) rands[498] * temp;
    temp = (float) rands[499] * temp;
    data[gid] = temp.s0;

}

The second kernel is:

__kernel void Test42(__global float *data, __global float *rands, int index, int rand_max){

    float16 temp[500];
    int gid = get_global_id(0);

    temp[0] = data[gid];
    temp[1] = (float) rands[1] * temp[0];
    temp[2] = (float) rands[2] * temp[1];
    temp[3] = (float) rands[3] * temp[2];
    temp[4] = (float) rands[4] * temp[3];
    .
    .
    .
    temp[497] = (float) rands[497] * temp[496];
    temp[498] = (float) rands[498] * temp[497];
    temp[499] = (float) rands[499] * temp[498];
    data[gid] = temp[index].s0;

}

As you can see in code, I'm using stream size of 16. every kernel has 500 lines of operations, where each of them only does a single floating point operation. I also deploy around 1048576 kernels in total, so I will have around 1048576 work items to execute in parallel.

In order to calculate the flops I do:

flops = #numWorkItems(1048576) * (500) * StreamSize(16) / timeTaken;

Unfortunately for the first kernel I get around 1.4 TFLOPs, but for the second kernel I get 38 GFLOPs. I was not able to explain this huge gap. using a vector of temp instead of a single temp seems to be a huge deal. Also seems like real applications are mostly like the second kernel. First kernel is too simple for a real application.

Can anyone help me to understand what exactly happening here and how second kernel performance can reach first one? In general, if I'm going to benchmark my device, should I expect to see performance near the theoretical value?

P.S. I understand I need to copy rands into a __local memory, but let's skip that for now.

Upvotes: 0

Views: 249

Answers (2)

Elalfer
Elalfer

Reputation: 5358

There are two possible issues:

  • you declared float16 temp buffer as __private {which is default in OpenCL} and most likely it will be allocated in the global memory space with quite high access latency. You might try to declare it as __local float16 if it would fit your device local memory.
  • Adding temp buffer created some problems for compiler... Original code is easily vectorizable on some GPU architectures (Intel for example) and you added artificial dependencies by adding store+load

I'd actually submit an issue report on the compiler for doing that. It should be easy enough for the compiler to figure out dependencies, do optimizations and even get rid of your temp buffer.

Upvotes: -1

Ruyk
Ruyk

Reputation: 795

As @pmdj has suggested in the comments, the main problem of the second kernel is register pressure: You are using a large number of hardware registers, which reduces the number of simultaneous work groups executing. In general, large private arrays are not a good idea in OpenCL/CUDA. There is very little a compiler can do to optimize the performance in that case. You could use local memory for the array, but then you need to add the appropriate synchronisation to access it.

Upvotes: 0

Related Questions