Kris
Kris

Reputation: 10307

Optimizing a Very Simple Image Processing Kernel

I was hoping someone could give me a hand here. I've been getting my feet wet in CUDA, and wrote a simple kernel to negate an image. It works brilliantly and I'm pretty happy with it.

I guess my rather stupid question is... is there anyway I could optimize this kernel? I tried to use shared memory, however the number of pixels is 19224000.

I tried to just do __shared__ int sharedMem[19224000], which simply didn't run. I'm a little lost here, as a CUDA programmer could probably tell.

Here is my kernel:

__global__ void cuda_negate_image(int * new_array, int * old_array, int rows, int cols){

    int tIdx = threadIdx.x;
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    int n = rows * cols;

   if (i < n)
        new_array[i] = -(old_array[i]) + 255;

}

Any help would be awesome!

Upvotes: 2

Views: 234

Answers (1)

talonmies
talonmies

Reputation: 72351

There isn't much scope for optimisation here. For simple, memory bound operations the four golden rules are usually:

  1. Coalesce memory reads and writes
  2. Maximise byte per memory transaction when using coalesced memory access
  3. Use the appropriate compiler heuristics to ensure that emitted code is optimal
  4. Amortise thread scheduling and setup overhead by having each thread process multiple inputs, where practical. (Note this requires a different approach to execution grid parameter selection, i.e. size for the utilisation of your device, rather than the total amount of available work)

Apply those principles to your kernel and I get something like this:

__device__ __forceinline__ void negate(int &in, int &out)
{
   out = 255 - in;
}
__device__ __forceinline__ void negate(int2 &in, int2 & out)
{
   negate(in.x, out.x);
   negate(in.y, out.y);
}
__device__ __forceinline__ void negate(int4 &in, int4 & out)
{
   negate(in.x, out.x);
   negate(in.y, out.y);
   negate(in.z, out.z);
   negate(in.w, out.w);
}
template<typename T>
__global__ void cuda_negate_image(T * __restrict__ new_array, T * __restrict__ old_array, int n)
{

   int i = blockDim.x * blockIdx.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;

   T oldval, newval;
   for(; i < n; i += stride) {
      oldval = old_array[i];
      negate(oldval, newval);
      new_array[i] = newval;
   }
}

template __global__ void cuda_negate_image<int>(int * __restrict__ new_array, int * __restrict__ old_array, int n);
template __global__ void cuda_negate_image<int2>(int2 * __restrict__ new_array, int2 * __restrict__ old_array, int n);
template __global__ void cuda_negate_image<int4>(int4 * __restrict__ new_array, int4 * __restrict__ old_array, int n);

Only benchmarking on your target hardware will tell you which version of the code is the fastest and whether this is even worth bothering with.

Upvotes: 4

Related Questions