Reputation: 27125
I'd like to send a 3D array src
of size size
in each dimension, flattened into a 1D array of size length = size * size * size
, into a kernel, compute a result and store it in dst
. However, at the end, dst
improperly contains all 0s. Here is my code:
int size = 256;
int length = size * size * size;
int bytes = length * sizeof(float);
// Allocate source and destination arrays on the host and initialize source array
float *src, *dst;
cudaMallocHost(&src, bytes);
cudaMallocHost(&dst, bytes);
for (int i = 0; i < length; i++) {
src[i] = i;
}
// Allocate source and destination arrays on the device
struct cudaPitchedPtr srcGPU, dstGPU;
struct cudaExtent extent = make_cudaExtent(size*sizeof(float), size, size);
cudaMalloc3D(&srcGPU, extent);
cudaMalloc3D(&dstGPU, extent);
// Copy to the device, execute kernel, and copy back to the host
cudaMemcpy(srcGPU.ptr, src, bytes, cudaMemcpyHostToDevice);
myKernel<<<numBlocks, blockSize>>>((float *)srcGPU.ptr, (float *)dstGPU.ptr);
cudaMemcpy(dst, dstGPU.ptr, bytes, cudaMemcpyDeviceToHost);
I've left out my error checking of cudaMallocHost()
, cudaMalloc()
and cudaMemcpy()
for clarity. No error is triggered by this code in any case.
What is the correct use of cudaMalloc3D()
with cudaMemcpy()
?
Please let me know if I should post a minimal test case for the kernel as well, or if the problem can be found in the code above.
Upvotes: 2
Views: 3237
Reputation: 4204
EDIT: the extent takes the number of elements if using a CUDA array, but effectively takes the number of bytes if not using a CUDA array (e.g. memory allocated with some non-array variant of cudaMalloc
)
From the Runtime API CUDA documentation:
The extent field defines the dimensions of the transferred area in elements. If a CUDA array is participating in the copy, the extent is defined in terms of that array's elements. If no CUDA array is participating in the copy then the extents are defined in elements of unsigned char
Also, cudaMalloc3D
returns a pitched pointer, meaning that it'll have at least the dimensions of your supplied extent, but possibly more for alignment reasons. You have to take this pitch into account when accessing and copying to and from your device memory. See here for the documentation on the cudaPitchedPtr
struct
As for using cudaMalloc3D
with cudaMemcpy
, you might want to take a look at using cudaMemcpy3D
(documentation here), it might make your life a bit easier in taking the pitch of your host and device memory into account. To use cudaMemcpy3D
you have to create a cudaMemcpy3DParms
struct with the appropriate information. It's members are:
cudaArray_t dstArray
struct cudaPos dstPos
struct cudaPitchedPtr dstPtr
struct cudaExtent extent
enumcudaMemcpyKind kind
cudaArray_t srcArray
struct cudaPos srcPos
struct cudaPitchedPtr srcPtr
and you must specify one of srcArray
or srcPtr
and one of dstArray
or dstPtr
. Also the docs recommend to initialize the struct to 0 before using it, e.g.
cudaMemcpy3DParms myParms = {0};
Also, you might be interested in taking a look at this other SO question
Upvotes: 3