Double precision CUDA code being faster than single precision counterpart for a fixed data size

Question

I have implemented an algorithm in CUDA and seems it's running faster with double precision than with single precision.

I know that usually single precision is faster in GPU. My GPU is Nvidia Geforce GT 650M.

The algorithm pseudo code is the following:

for k to numIterations
    for j to numRowsOfAMatrix
        CUDAmemset(double arrayGPU)
        CUBLASdotproduct(double arrayGPU,double arrayGPU) [using cublasDdot]
        CUBLASdotproduct(double arrayGPU,double arrayGPU) [using cublasDdot]
        CUBLASscalarVectorMultiplication(scalarCPU,double arrayGPU) [using cublasDaxpy]
        CUBLASvectorSum(double arrayGPU,double arrayGPU) [using cublasDaxpy]
    end
end

I've run some tests with the following properties: Arrays are 2500 length. Matrix row lenght is 2700.

The times that I'm obtaining are the following:

50 iterations:

20.9960 seconds for single

20.1881 seconds for double

200 iterations:

81.9562 seconds for single

78.9490 seconds for double

500 iterations:

199.661 seconds for single

199.045 seconds for double

1000 iterations:

413.129 seconds for single

396.205 seconds for double

Any idea why double precision is faster?

talonmies · Accepted Answer

I don't believe you can say that the double precision version is faster than the single precision version. Your own timing shows both take about 20 seconds for 50 iterations and about 200 seconds for 500 iterations. The question then becomes why?

To me it just looks like your code is dominated by API and PCI-e bus latency. Even the two times memory bandwidth difference between single and double precision is probably irrelevant in this case. If each array is only about 2500 long, then the arithmetic and device memory transaction portions of the calculation will be absolutely tiny compared to the overall execution time.

Looking at your pseudocode shows why. At each iteration, the two dot calls have launch one or more kernels, wait for them to finish, then download a scalar result from the device. Then scalars have to be uploaded to the device for each axpy call followed by a kernel launch. From the information in comments, this means you code performs perhaps two blocking memory copies and six kernel launches per input row, and there are 2700 input rows per iteration. That means you code is performing 10-15 thousand GPU API calls per iteration, which is a lot of transactions and API latency (especially if you are doing this on a WDDM Windows platform) for a nothing more than a few thousand FLOPs and a few tens of kb of GPU memory access per row.

The fact that your GPU has 12 times higher peak single precision than double precision arithmetic throughput is irrelevant in this case, because the computation time is a vanishingly small fraction of the total wall clock time you measure.

Double precision CUDA code being faster than single precision counterpart for a fixed data size

Answers (2)

Related Questions