Se Norm
Se Norm

Reputation: 1755

OpenCL very low GFLOPS, no data transfer bottleneck

I am trying to optimize an algorithm I am running on my GPU (AMD HD6850). I counted the number of floating point operations inside my kernel and measured its execution time. I found it to achieve ~20 SP GFLOPS, however according to the GPUs specs I should achieve ~1500 GFLOPS.

To find the bottleneck I created a very simple kernel:

kernel void test_gflops(const float d, global float* result)
{
    int gid = get_global_id(0);
    float cd;

    for (int i=0; i<100000; i++)
    {
        cd = d*i;
    }

    if (cd == -1.0f)
    {
        result[gid] = cd;
    }
}

Running this kernel I get ~5*10^5 work_items/sec. I count one floating point operation (not sure if that's right, what about incrementing i and comparing it to 100000?) per iteration of the loop.

==> 5*10^5 work_items/sec * 10^5 FLOPS = 50 GFLOPS.

Even if there are 3 or 4 operations going on in the loop, it's much slower than the what the card should be able to do. What am I doing wrong?

The global work size is big enough (no speed change for 10k vs 100k work items).

Upvotes: 1

Views: 551

Answers (1)

Roman Arzumanyan
Roman Arzumanyan

Reputation: 1814

Here are a couple of tricks:

  1. GPU doesn't like cycles at all. Use #pragma unroll to unwind them.
  2. Your GPU is good at vector operations. Stick to it, that will allow you to process multiple operands at once.
  3. Use vector load/store whether it's possible.
  4. Measure the memory bandwidth - I'm almost sure that you are bandwidth-limited because of poor access pattern.

In my opinion, kernel should look like this:

typedef union floats{
    float16 vector;
    float array[16];
} floats;

kernel void test_gflops(const float d, global float* result)
{
    int gid = get_global_id(0);
    floats cd;
    cd.vector = vload16(gid, result);
    cd.vector *= d;

    #pragma unroll
    for (int i=0; i<16; i++)
    {
        if(cd.array[i] == -1.0f){
        result[gid] = cd;
    }
}

Make your NDRange bigger to compensate difference between 16 & 1000 in loop condition.

Upvotes: 2

Related Questions