Reputation: 1663
As mentioned in title, I found that the function of cudaMallocPitch()
consumes a lot of time and cudaMemcpy2D()
consumes quite some time as well.
Here is the code I am using:
cudaMallocPitch((void **)(&SrcDst), &DeviceStride, Size.width * sizeof(float), Size.height);
cudaMemcpy2D(SrcDst, DeviceStride * sizeof(float),
ImgF1, StrideF * sizeof(float),
Size.width * sizeof(float), Size.height,
cudaMemcpyHostToDevice);
In implementation, the Size.width
and Size.height
are both 4800. The time consuming for cudaMallocPitch()
is about 150-160ms (multiple tests in case accidents) and cudaMemcpy2D()
consumes about 50ms.
It seems not possible that the memory bandwidth between the CPU and GPU is so limited, but I cannot see any errors in code, so what is the reason?
By the way, the hardware I am using are Intel I7-4770K CPU and Nvidia Geforce GTX 780(quite good hardware without error).
Upvotes: 0
Views: 3058
Reputation: 151944
There are many factors here which may be impacting performance.
Regarding cudaMallocPitch
, if it happens to be the first cuda call in your program, it will incur additional overhead.
Regarding cudaMemcpy2D
, this is accomplished under the hood via a sequence of individual memcpy operations, one per row of your 2D area (i.e. 4800 individual DMA operations). This will necessarily incur additional overhead compared to an ordinary cudaMemcpy
operation (which transfers the entire data area in a single DMA transfer). Furthermore, peak transfer speeds are only achieved when the host side memory buffer is pinned. Finally, you don't indicate anything about your platform. If you are on windows, then WDDM will interfere with full transfer performance for this operation, and we don't know what kind of PCIE link you are on.
4800*4800*4/0.050 = 1.84GB/s which is a significant fraction of the ~3GB/s that is roughly available for a non-pinned transfer across PCIE 2.0. The reduction from 3GB to 1.84GB is easily explainable by the other factors I list above.
If you want full transfer performance, use pinned memory and don't use a pitched/2D transfer.
Upvotes: 3