Using CUDA to find the pixel-wise average value of a bunch of images

Question

So I have a cube of images. 512X512X512, I want to sum up the images pixel-wise and save it to a final resulting image. So if all the pixels were value 1...the final image would all be 512. I am having trouble understanding the indexing to do this in CUDA. I figure one thread's job will be to sum up all 512 at it's pixel...so the total thread number will be 512X512. So I plan to do it with 512 blocks, with 512 threads each. From here, I am having trouble coming up with the indexing of how to sum the depth. Any help will be greatly appreciated.

pQB · Accepted Answer

One way to solve this problem is imaging the cube as a set of Z slides. The coordinates X, Y refers to the width and height of the image, and the Z coordinate to each slide in the Z dimension. Each thread will iterate in the Z coordinate to accumulate the values.

With this in mind, configure a kernel to launch a block of 16x16 threads and a grid of enough blocks to process the width and height of the image (I'm assuming a gray scale image with 1 byte per pixel):

#define THREADS 16
// kernel configuration
dim3 dimBlock = dim3 ( THREADS, THREADS, 1 );
dim3 dimGrid  = dim3 ( WIDTH / THREADS, HEIGHT / THREADS );
// call the kernel
kernel<<>>(i_data, o_Data, WIDTH, HEIGHT, DEPTH);

If you are clear how to index a 2D array, loop through the Z dimension would be also clear

__global__ void kernel(unsigned char* i_data, unsigned char* o_data, int WIDTH, int HEIGHT, int DEPTH)
{
  // in your kernel map from threadIdx/BlockIdx to pixel position
  int x = threadIdx.x + blockIdx.x * blockDim.x;
  int y = threadIdx.y + blockIdx.y * blockDim.y;
  // calculate the global index of a pixel into the image array
  // this global index is to the first slide of the cube
  int idx = x + y * WIDTH;

  // partial results
  int r = 0;

  // iterate in the Z dimension
  for (int z = 0; z < DEPTH; ++z)
  {
    // WIDTH * HEIGHT is the offset of one slide
    int idx_z = z * WIDTH*HEIGHT + idx;
    r += i_data[ idx_z ];
  }
  // o_data is a 2D array, so you can use the global index idx
  o_data[ idx ] = r;
}

This is a naive implementation. In order to maximize memory throughput, the data should be properly aligned.

Using CUDA to find the pixel-wise average value of a bunch of images

Answers (2)

Related Questions