Reputation: 8118
So I want to know how to calculate the total memory effective bandwidth for:
cublasSdot(handle, M, devPtrA, 1, devPtrB, 1, &curesult);
where that function belows to cublas_v2.h
That function runs in 0.46 ms, and the vectors are 10000 * sizeof(float)
Am I having ((10000 * 4) / 10^9 )/0.00046 = 0.086 GB/s
?
I'm wondering about it because I don't know what is inside the cublasSdot function, and I don't know if it is necesary.
Upvotes: 1
Views: 761
Reputation: 504
If kernel time is in ms then a multiplication factor of 1000 is necessary. That results in 86 GB/s.
As an example refer to example provide by NVIDIA for Matrix Transpose at http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf
On Last Page entire code is present. The way the Effective Bandwidth is computed is 2.*1000*mem_size/(1024*1024*1024)/(Time in ms)
Upvotes: 0
Reputation: 9779
In your case, the size of the input data is 10000 * 4 * 2 since you have 2 input vectors, and the size of the output data is 4. The effective bandwidth should be about 0.172 GB/s.
Basically cublasSdot()
does nothing much more than computing.
Profile result shows cublasSdot()
invokes 2 kernels to compute the result. An extra 4-bytes device-to-host mem transfer is also invoked if the pointer mode is CUBLAS_POINTER_MODE_HOST
, which is the default mode for cublas lib.
Upvotes: 3