Reputation: 2444
I want to make sure that meassuring time of clEnqueueReadBuffer()
proper way how to meassure time required to copy data from GPU to main memory. I'm a bit suspicious that clFinish()
does contribute as well.
I run 100x dot product of 1000000 floats and meassure time by this
// RUN TIME BLOCK
println( " Running OpenCL program ... " );
t1 = System.nanoTime();
for (int reps = 0; reps < 100; reps++) {
clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, global_work_size, local_work_size, 0, null, null);
}
clFinish(commandQueue); // VERSION 1
t2 = System.nanoTime();
println( " OpenCL Run Time : "+ ((t2-t1)*1e-9)+" [s] " );
// READ OUT TIME BLOCK
t1 = System.nanoTime();
//clFinish(commandQueue); // VERSION 2
clEnqueueReadBuffer(commandQueue, memObjects[2], CL_TRUE, 0, n * Sizeof.cl_float, dst, 0, null, null);
t2 = System.nanoTime();
println( " Read-out Time: "+ ((t2-t1)*1e-9)+" [s] " );
got results
OpenCL Run Time : 2.5124469 [s]
Read-out Time: 0.002145424 [s]
It seem a bit too Good for me just 2 milliseconds and <0.1% of total time ... the problem of CPU-GPU communication bottleneck does not seems to be so bad.
when I put clFinish(commandQueue);
inside the Read-out Time
block I got these results
OpenCL Run Time : 1.0892084 [s]
Read-out Time: 1.4300439 [s]
Which on the other hand seems too bad ... it is faster to do 100 multiplications on GPU than copy it by PCI-express ? .... well maybe
Just for completness:
I used openclp5 library for processing which use jocl with Java jdk 1.7 on Ubuntu 12.04 64 bit with Quadro FX 580 GPU and
my kernel is simple (no optimization or anything )
String programSource =
"__kernel void sampleKernel( "+
"__global const float *a, __global const float *b, __global float *c) { "+
" int gid = get_global_id(0); "+
" c[gid] = a[gid] * b[gid]; "+
"}";
Upvotes: 0
Views: 285
Reputation: 9925
Your first method of measuring the time to read data back to the host is correct. The second approach would include some computation time.
The amount of data you are reading is 1000000 * sizeof(float) = 4MB
. If this is taking 2 ms, then that means you are achieving a bandwidth of 4MB/0.002s = 2 GB/s
. Why do you think this is too good to be true? Your card supports PCIe x16, which has a theoretical peak bandwidth of 8 GB/s (in one direction).
Upvotes: 1