Reputation: 1755
I am trying to optimize an algorithm I am running on my GPU (AMD HD6850). I counted the number of floating point operations inside my kernel and measured its execution time. I found it to achieve ~20 SP GFLOPS, however according to the GPUs specs I should achieve ~1500 GFLOPS.
To find the bottleneck I created a very simple kernel:
kernel void test_gflops(const float d, global float* result)
{
int gid = get_global_id(0);
float cd;
for (int i=0; i<100000; i++)
{
cd = d*i;
}
if (cd == -1.0f)
{
result[gid] = cd;
}
}
Running this kernel I get ~5*10^5 work_items/sec. I count one floating point operation (not sure if that's right, what about incrementing i and comparing it to 100000?) per iteration of the loop.
==> 5*10^5 work_items/sec * 10^5 FLOPS = 50 GFLOPS.
Even if there are 3 or 4 operations going on in the loop, it's much slower than the what the card should be able to do. What am I doing wrong?
The global work size is big enough (no speed change for 10k vs 100k work items).
Upvotes: 1
Views: 551
Reputation: 1814
Here are a couple of tricks:
In my opinion, kernel should look like this:
typedef union floats{
float16 vector;
float array[16];
} floats;
kernel void test_gflops(const float d, global float* result)
{
int gid = get_global_id(0);
floats cd;
cd.vector = vload16(gid, result);
cd.vector *= d;
#pragma unroll
for (int i=0; i<16; i++)
{
if(cd.array[i] == -1.0f){
result[gid] = cd;
}
}
Make your NDRange bigger to compensate difference between 16 & 1000 in loop condition.
Upvotes: 2