Reputation: 667
I have implemented an algorithm in CUDA and seems it's running faster with double precision than with single precision.
I know that usually single precision is faster in GPU. My GPU is Nvidia Geforce GT 650M.
The algorithm pseudo code is the following:
for k to numIterations
for j to numRowsOfAMatrix
CUDAmemset(double arrayGPU)
CUBLASdotproduct(double arrayGPU,double arrayGPU) [using cublasDdot]
CUBLASdotproduct(double arrayGPU,double arrayGPU) [using cublasDdot]
CUBLASscalarVectorMultiplication(scalarCPU,double arrayGPU) [using cublasDaxpy]
CUBLASvectorSum(double arrayGPU,double arrayGPU) [using cublasDaxpy]
end
end
I've run some tests with the following properties: Arrays are 2500 length. Matrix row lenght is 2700.
The times that I'm obtaining are the following:
50 iterations:
20.9960 seconds for single
20.1881 seconds for double
200 iterations:
81.9562 seconds for single
78.9490 seconds for double
500 iterations:
199.661 seconds for single
199.045 seconds for double
1000 iterations:
413.129 seconds for single
396.205 seconds for double
Any idea why double precision is faster?
Upvotes: 1
Views: 2616
Reputation: 21515
The difference in computational cost between two algorithms (in your case, the single and double precision versions) is generally measured by the asymptotic computational complexity. It is not surprising that double precision can have the same performance as single precision for a fixed (small, in your case) vector length, for the reasons explained by talonmies (latency). To really state which algorithm is faster, you should analyze the timing against the vector length N
, starting from small to large values of N
.
Another example, which however has nothing to do with GPGPU, is FFT, which has an asymptotic complexity of O(NlogN)
and then is more convenient than "brute-force" summation of DFT, which as O(N^2)
complexity. But, if you compare the timing between FFT and "brute-force" DFT summation for very low values of N
, you will find that "brute-force" DFT summation will take the least time.
Upvotes: 0
Reputation: 72372
I don't believe you can say that the double precision version is faster than the single precision version. Your own timing shows both take about 20 seconds for 50 iterations and about 200 seconds for 500 iterations. The question then becomes why?
To me it just looks like your code is dominated by API and PCI-e bus latency. Even the two times memory bandwidth difference between single and double precision is probably irrelevant in this case. If each array is only about 2500 long, then the arithmetic and device memory transaction portions of the calculation will be absolutely tiny compared to the overall execution time.
Looking at your pseudocode shows why. At each iteration, the two dot calls have launch one or more kernels, wait for them to finish, then download a scalar result from the device. Then scalars have to be uploaded to the device for each axpy call followed by a kernel launch. From the information in comments, this means you code performs perhaps two blocking memory copies and six kernel launches per input row, and there are 2700 input rows per iteration. That means you code is performing 10-15 thousand GPU API calls per iteration, which is a lot of transactions and API latency (especially if you are doing this on a WDDM Windows platform) for a nothing more than a few thousand FLOPs and a few tens of kb of GPU memory access per row.
The fact that your GPU has 12 times higher peak single precision than double precision arithmetic throughput is irrelevant in this case, because the computation time is a vanishingly small fraction of the total wall clock time you measure.
Upvotes: 6