In CUDA, why cudaMemcpy2D and cudaMallocPitch consume a lot of time

Question

As mentioned in title, I found that the function of cudaMallocPitch() consumes a lot of time and cudaMemcpy2D() consumes quite some time as well.

Here is the code I am using:

cudaMallocPitch((void **)(&SrcDst), &DeviceStride, Size.width * sizeof(float), Size.height);

cudaMemcpy2D(SrcDst, DeviceStride * sizeof(float), 
        ImgF1, StrideF * sizeof(float), 
        Size.width * sizeof(float), Size.height,
        cudaMemcpyHostToDevice);

In implementation, the Size.width and Size.height are both 4800. The time consuming for cudaMallocPitch() is about 150-160ms (multiple tests in case accidents) and cudaMemcpy2D() consumes about 50ms.

It seems not possible that the memory bandwidth between the CPU and GPU is so limited, but I cannot see any errors in code, so what is the reason?

By the way, the hardware I am using are Intel I7-4770K CPU and Nvidia Geforce GTX 780(quite good hardware without error).

In CUDA, why cudaMemcpy2D and cudaMallocPitch consume a lot of time

Answers (1)

Related Questions