Proper use of cudaMalloc3D with cudaMemcpy

Question

I'd like to send a 3D array src of size size in each dimension, flattened into a 1D array of size length = size * size * size, into a kernel, compute a result and store it in dst. However, at the end, dst improperly contains all 0s. Here is my code:

int size = 256;
int length = size * size * size;
int bytes = length * sizeof(float);

// Allocate source and destination arrays on the host and initialize source array

float *src, *dst;
cudaMallocHost(&src, bytes);
cudaMallocHost(&dst, bytes);
for (int i = 0; i < length; i++) {
    src[i] = i;
}

// Allocate source and destination arrays on the device

struct cudaPitchedPtr srcGPU, dstGPU;
struct cudaExtent extent = make_cudaExtent(size*sizeof(float), size, size);
cudaMalloc3D(&srcGPU, extent);
cudaMalloc3D(&dstGPU, extent);

// Copy to the device, execute kernel, and copy back to the host

cudaMemcpy(srcGPU.ptr, src, bytes, cudaMemcpyHostToDevice);
myKernel<<>>((float *)srcGPU.ptr, (float *)dstGPU.ptr);
cudaMemcpy(dst, dstGPU.ptr, bytes, cudaMemcpyDeviceToHost);

I've left out my error checking of cudaMallocHost(), cudaMalloc() and cudaMemcpy() for clarity. No error is triggered by this code in any case.

What is the correct use of cudaMalloc3D() with cudaMemcpy()?

Please let me know if I should post a minimal test case for the kernel as well, or if the problem can be found in the code above.

alrikai · Accepted Answer

EDIT: the extent takes the number of elements if using a CUDA array, but effectively takes the number of bytes if not using a CUDA array (e.g. memory allocated with some non-array variant of cudaMalloc)

From the Runtime API CUDA documentation:

The extent field defines the dimensions of the transferred area in elements. If a CUDA array is participating in the copy, the extent is defined in terms of that array's elements. If no CUDA array is participating in the copy then the extents are defined in elements of unsigned char

Also, cudaMalloc3D returns a pitched pointer, meaning that it'll have at least the dimensions of your supplied extent, but possibly more for alignment reasons. You have to take this pitch into account when accessing and copying to and from your device memory. See here for the documentation on the cudaPitchedPtr struct

As for using cudaMalloc3D with cudaMemcpy, you might want to take a look at using cudaMemcpy3D (documentation here), it might make your life a bit easier in taking the pitch of your host and device memory into account. To use cudaMemcpy3D you have to create a cudaMemcpy3DParms struct with the appropriate information. It's members are:

cudaArray_t dstArray
struct cudaPos dstPos
struct cudaPitchedPtr dstPtr
struct cudaExtent extent
enumcudaMemcpyKind kind
cudaArray_t srcArray
struct cudaPos srcPos
struct cudaPitchedPtr srcPtr

and you must specify one of srcArray or srcPtr and one of dstArray or dstPtr. Also the docs recommend to initialize the struct to 0 before using it, e.g. cudaMemcpy3DParms myParms = {0};

Also, you might be interested in taking a look at this other SO question

Proper use of cudaMalloc3D with cudaMemcpy

Answers (1)

Related Questions