OpenCL very low GFLOPS, no data transfer bottleneck

Question

I am trying to optimize an algorithm I am running on my GPU (AMD HD6850). I counted the number of floating point operations inside my kernel and measured its execution time. I found it to achieve ~20 SP GFLOPS, however according to the GPUs specs I should achieve ~1500 GFLOPS.

To find the bottleneck I created a very simple kernel:

kernel void test_gflops(const float d, global float* result)
{
    int gid = get_global_id(0);
    float cd;

    for (int i=0; i<100000; i++)
    {
        cd = d*i;
    }

    if (cd == -1.0f)
    {
        result[gid] = cd;
    }
}

Running this kernel I get ~5*10^5 work_items/sec. I count one floating point operation (not sure if that's right, what about incrementing i and comparing it to 100000?) per iteration of the loop.

==> 5*10^5 work_items/sec * 10^5 FLOPS = 50 GFLOPS.

Even if there are 3 or 4 operations going on in the loop, it's much slower than the what the card should be able to do. What am I doing wrong?

The global work size is big enough (no speed change for 10k vs 100k work items).

OpenCL very low GFLOPS, no data transfer bottleneck

Answers (1)

Related Questions