Reputation: 846
I have the following code that I'm trying to implement in cuda but I'm having a problem of flattening a 3D array to 1D in cuda
C++ code
for(int i=0; i<w; i++)
for(int j=0; j<h; j++)
for(int k=0; k<d; k++)
arr[h*w*i+ w*j+ k] = (h*w*i+ w*j+ k)*2;
This is what I have so far in Cuda
int w = h = d;
int N = 64;
__global__ void getIndex(float* A)
{
int i = blockIdx.x;
int j = blockIdx.y;
int k = blockIdx.z;
A[h*w*i+ w*j+ k] = h*w*i+ w*j+ k;
}
int main(int argc, char **argv)
{
float *d_A;
cudaMalloc((void **)&d_A, w * h * d * sizeof(float) );
getIndex <<<N,1>>> (d_A);
}
But I'm not getting the result I'm expecting, I do not know how to get the right i,j
and k
indices
Upvotes: 0
Views: 2182
Reputation: 861
Consider a 3D problem of size w
x h
x d
. (This could be a simple array which has to be set like in your question or any other 3D problem that is easy to parallelize.) I will use your simple set-task for demonstration purpose.
The easiest way to handle this with a CUDA kernel is to launch one thread per array entry, that is w*h*d
threads. This answer discusses why one thread per element may not always be the best solution.
Now let us have a look at the following lines of code
dim3 numThreads(w,h,d);
getIndex <<<1, numThreads>>> (d_A, w, h, d);
Here we are launching a kernel with a total of w*h*d
threads.
The kernel can than be implemented as
__global__ void getIndex(float* A, int w, int h, int d) // we actually do not need w
{
int i = threadIdx.x;
int j = threadIdx.y;
int k = threadIdx.z;
A[h*d*i+ d*j+ k] = h*d*i+ d*j+ k;
}
But there is a problem with this kernel and the kernel call: The number of threads per thread block is limited (also the number of "threads in a specific direction" is bounded = the z direction is generally most bounded). As we are only calling one thread block our problem size cannot be exceed these certain limits (e.g. w*h*d <= 1024
).
This is what threadblocks are for. You can practically launch a kernel with as many threads as you want. (This is not true but the limits for the maximal amount of threadblocks are not likely to be exhausted.)
Calling the kernel this way:
dim3 numBlocks(w/8,h/8,d/8);
dim3 numThreads(8,8,8);
getIndex <<<numBlocks, numThreads>>> (d_A, w, h, d);
will launch the kernel for w/8 * h/8 * d/8
thread blocks while every block contains 8*8*8
threads. So in total w*h*d
threads will be called.
Now we have to adjust our kernel accordingly:
__global__ void getIndex(float* A, int w, int h, int d) // we actually do not need w
{
int bx = blockIdx.x;
int by = blockIdx.y;
int bz = blockIdx.z;
int tx = threadIdx.x;
int ty = threadIdx.y;
int tz = threadIdx.z;
A[h*d*(8*bx + tx)+ d*(8*by + ty)+ (8*bz + tz)] = h*d*(8*bx + tx)+ d*(8*by + ty)+ (8*bz + tz);
}
Note:
blockDim.x
instead of the fixed size 8
and gridDim.x
to calculate w
via gridDim.x*blockDim.x
. The other two dimensions are handled likewise.w
, h
and d
have to be multiples of 8. You can also generalize the kernel to allow every dimensions. (Then you have to parse all three dimensions to the kernel to check if the calculated position is still in range of the problem.)Upvotes: 2