Reputation: 2724
These are my results of running cublas DGEMM on 4 GPUs using 2 streams for each GPU (Tesla M2050):
I have tested my results and they are alright; I am concerned about the high Gflops value that I am getting, compared with the versions that uses the default stream. I am calculating the Gflops using the formula:
Gflops = {2.0*10^-9*(N^3+N^2)}/elapsed_time_in_s
For the version that uses multiple streams, do I need to modify this formula in any way?
The HtoD-ker-DtoH
is the time taken for host to device data transfer, kernel execution and device to host data transfer in seconds (this is the denominator of the formula above).
Crosspost to Nvidia forums - http://forums.nvidia.com/index.php?showtopic=219910&st=0#entry1350908
EDIT: Following the comment of @talonmies, I added a cudaStreamSynchronize
before calculating the time, and the results are as follows:
Thanks,
Sayan
Upvotes: 2
Views: 1791
Reputation: 72349
A single C2050 gives about 550 GFLOP/s peak, or about 2200 GFLOP/s for 4 peak for double precision, and DGEMM is considerably lower than peak), so I would guess that you timing is wrong in the streams case (probably something that was synchronous in the default stream case is now asynchronous). The FLOP/s calculation should not change no matter how you do the computations.
I would review your code to ensure that whatever timing mechanism you use is synchronized to all the streams you launch, either via the cudaStreamWaitEvent
mechanism across all streams, or cudaStreamSynchronize
per stream. It is likely that the timing is falling out of the code you are trying to time before the GPU has finishing the CUBLAS operations.
Upvotes: 3