Reputation: 311
I have two almost the same OpenCL kernels which I want to calculate their performance in GFLOPS. Kernel #1 is:
__kernel void Test41(__global float *data, __global float *rands, int index, int rand_max){
float16 temp;
int gid = get_global_id(0);
temp = data[gid];
temp = (float) rands[1] * temp;
temp = (float) rands[2] * temp;
temp = (float) rands[3] * temp;
temp = (float) rands[4] * temp;
.
.
.
temp = (float) rands[497] * temp;
temp = (float) rands[498] * temp;
temp = (float) rands[499] * temp;
data[gid] = temp.s0;
}
The second kernel is:
__kernel void Test42(__global float *data, __global float *rands, int index, int rand_max){
float16 temp[500];
int gid = get_global_id(0);
temp[0] = data[gid];
temp[1] = (float) rands[1] * temp[0];
temp[2] = (float) rands[2] * temp[1];
temp[3] = (float) rands[3] * temp[2];
temp[4] = (float) rands[4] * temp[3];
.
.
.
temp[497] = (float) rands[497] * temp[496];
temp[498] = (float) rands[498] * temp[497];
temp[499] = (float) rands[499] * temp[498];
data[gid] = temp[index].s0;
}
As you can see in code, I'm using stream size of 16. every kernel has 500 lines of operations, where each of them only does a single floating point operation. I also deploy around 1048576 kernels in total, so I will have around 1048576 work items to execute in parallel.
In order to calculate the flops I do:
flops = #numWorkItems(1048576) * (500) * StreamSize(16) / timeTaken;
Unfortunately for the first kernel I get around 1.4 TFLOPs, but for the second kernel I get 38 GFLOPs. I was not able to explain this huge gap. using a vector of temp instead of a single temp seems to be a huge deal. Also seems like real applications are mostly like the second kernel. First kernel is too simple for a real application.
Can anyone help me to understand what exactly happening here and how second kernel performance can reach first one? In general, if I'm going to benchmark my device, should I expect to see performance near the theoretical value?
P.S. I understand I need to copy rands into a __local memory, but let's skip that for now.
Upvotes: 0
Views: 249
Reputation: 5358
There are two possible issues:
float16
temp buffer as __private
{which is default in OpenCL} and most likely it will be allocated in the global
memory space with quite high access latency. You might try to declare it as __local float16
if it would fit your device local memory.temp
buffer created some problems for compiler... Original code is easily vectorizable on some GPU architectures (Intel for example) and you added artificial dependencies by adding store
+load
I'd actually submit an issue report on the compiler for doing that. It should be easy enough for the compiler to figure out dependencies, do optimizations and even get rid of your temp
buffer.
Upvotes: -1
Reputation: 795
As @pmdj has suggested in the comments, the main problem of the second kernel is register pressure: You are using a large number of hardware registers, which reduces the number of simultaneous work groups executing. In general, large private arrays are not a good idea in OpenCL/CUDA. There is very little a compiler can do to optimize the performance in that case. You could use local memory for the array, but then you need to add the appropriate synchronisation to access it.
Upvotes: 0