Flattening a 3D array to 1D in cuda

Question

I have the following code that I'm trying to implement in cuda but I'm having a problem of flattening a 3D array to 1D in cuda

C++ code

for(int i=0; i



This is what I have so far in Cuda

  int w = h = d;
  int N = 64;

 __global__ void getIndex(float* A)
{
  int i = blockIdx.x;
  int j = blockIdx.y;
  int k = blockIdx.z;
  A[h*w*i+ w*j+ k] = h*w*i+ w*j+ k;
}


int main(int argc, char **argv)
 {

    float *d_A;
    cudaMalloc((void **)&d_A, w * h * d * sizeof(float) );
    getIndex <<>> (d_A);
  }


But I'm not getting the result I'm expecting, I do not know how to get the right i,j and k indices

BlameTheBits · Accepted Answer

Consider a 3D problem of size w x h x d. (This could be a simple array which has to be set like in your question or any other 3D problem that is easy to parallelize.) I will use your simple set-task for demonstration purpose.

The easiest way to handle this with a CUDA kernel is to launch one thread per array entry, that is w*h*d threads. This answer discusses why one thread per element may not always be the best solution.

Now let us have a look at the following lines of code

dim3 numThreads(w,h,d);
getIndex <<<1, numThreads>>> (d_A, w, h, d);

Here we are launching a kernel with a total of w*h*d threads. The kernel can than be implemented as

__global__ void getIndex(float* A, int w, int h, int d) // we actually do not need w
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    int k = threadIdx.z;
    A[h*d*i+ d*j+ k] = h*d*i+ d*j+ k;
}

But there is a problem with this kernel and the kernel call: The number of threads per thread block is limited (also the number of "threads in a specific direction" is bounded = the z direction is generally most bounded). As we are only calling one thread block our problem size cannot be exceed these certain limits (e.g. w*h*d <= 1024).

This is what threadblocks are for. You can practically launch a kernel with as many threads as you want. (This is not true but the limits for the maximal amount of threadblocks are not likely to be exhausted.)

Calling the kernel this way:

dim3 numBlocks(w/8,h/8,d/8);
dim3 numThreads(8,8,8);
getIndex <<>> (d_A, w, h, d);

will launch the kernel for w/8 * h/8 * d/8 thread blocks while every block contains 8*8*8 threads. So in total w*h*d threads will be called. Now we have to adjust our kernel accordingly:

__global__ void getIndex(float* A, int w, int h, int d) // we actually do not need w
{
    int bx = blockIdx.x;
    int by = blockIdx.y;
    int bz = blockIdx.z;
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    int tz = threadIdx.z;
    A[h*d*(8*bx + tx)+ d*(8*by + ty)+ (8*bz + tz)] = h*d*(8*bx + tx)+ d*(8*by + ty)+ (8*bz + tz);
}

Note:

You can write a more general kernel using blockDim.x instead of the fixed size 8 and gridDim.x to calculate w via gridDim.x*blockDim.x. The other two dimensions are handled likewise.
In the proposed example all three dimensions w, h and d have to be multiples of 8. You can also generalize the kernel to allow every dimensions. (Then you have to parse all three dimensions to the kernel to check if the calculated position is still in range of the problem.)
As already mentioned, it may be more efficient to edit more than one entry of the array per thread. This again have to be considered when calling the kernel. A wrapper function which takes the problem size and the data and calls the kernel with the right block and thread configuration may be useful.

Flattening a 3D array to 1D in cuda

Answers (1)

Related Questions