qiuhan1989
qiuhan1989

Reputation: 1663

In CUDA, why cudaMemcpy2D and cudaMallocPitch consume a lot of time

As mentioned in title, I found that the function of cudaMallocPitch() consumes a lot of time and cudaMemcpy2D() consumes quite some time as well.

Here is the code I am using:

cudaMallocPitch((void **)(&SrcDst), &DeviceStride, Size.width * sizeof(float), Size.height);

cudaMemcpy2D(SrcDst, DeviceStride * sizeof(float), 
        ImgF1, StrideF * sizeof(float), 
        Size.width * sizeof(float), Size.height,
        cudaMemcpyHostToDevice);

In implementation, the Size.width and Size.height are both 4800. The time consuming for cudaMallocPitch() is about 150-160ms (multiple tests in case accidents) and cudaMemcpy2D() consumes about 50ms.

It seems not possible that the memory bandwidth between the CPU and GPU is so limited, but I cannot see any errors in code, so what is the reason?

By the way, the hardware I am using are Intel I7-4770K CPU and Nvidia Geforce GTX 780(quite good hardware without error).

Upvotes: 0

Views: 3058

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151944

There are many factors here which may be impacting performance.

Regarding cudaMallocPitch, if it happens to be the first cuda call in your program, it will incur additional overhead.

Regarding cudaMemcpy2D, this is accomplished under the hood via a sequence of individual memcpy operations, one per row of your 2D area (i.e. 4800 individual DMA operations). This will necessarily incur additional overhead compared to an ordinary cudaMemcpy operation (which transfers the entire data area in a single DMA transfer). Furthermore, peak transfer speeds are only achieved when the host side memory buffer is pinned. Finally, you don't indicate anything about your platform. If you are on windows, then WDDM will interfere with full transfer performance for this operation, and we don't know what kind of PCIE link you are on.

4800*4800*4/0.050 = 1.84GB/s which is a significant fraction of the ~3GB/s that is roughly available for a non-pinned transfer across PCIE 2.0. The reduction from 3GB to 1.84GB is easily explainable by the other factors I list above.

If you want full transfer performance, use pinned memory and don't use a pitched/2D transfer.

Upvotes: 3

Related Questions