lower occupancy - better performance

Question

The following has been troubling me.

Running the same kernel with two different devices, one with compute capability 1.3 and the other with compute capability 2.0, I get better performance with more threads per block (high occupancy) in the 1.3 but the opposite in the 2.0. The peak of performance for the 2.0 seems to be 16 threads per block, an occupancy of 17% Anything less or anything greater than this point has worst performance.

Since it's most likely the cause of this to be the nature of the kernel itself here it is.

__global__ void
kernel_CalculateRFCH (int xstart, int ystart, int xsize,
          int ysize, int imxsize, int imysize, int *test, int *dev_binIm, int *per_block_results)
{
  int x2, y2, bin, bin2;
  __shared__ int s_pixels[blockDim.x*blockDim.y];  //this wouldn't compile in reailty

  int tx = threadIdx.x;
  int ty = threadIdx.y;
  int tidy = threadIdx.y + blockIdx.y * blockDim.y;
  int tidx = threadIdx.x + blockIdx.x * blockDim.x;

  if (xstart + xsize > imxsize)
    xsize = imxsize - xstart;
  if (ystart + ysize > imysize)
    ysize = imysize - ystart;

  s_pixels[tx * blockDim.y + ty] = 0;

  if (tidy >= ystart && tidy < ysize + ystart && tidx >= xstart && tidx < xsize + xstart)
{
      bin = dev_binIm[tidx + tidy * imxsize];

      if (bin >= 0)
    {
      x2 = tidx;
      y2 = tidy;

         while (y2 < ystart + ysize)
          {
          if (x2 >= xstart + xsize || x2 - tidx > 10)
             {
                  x2 = xstart;
                  y2++;
                  if (tidx - x2 > 10)
                   x2 = tidx - 10;
                  if (y2 - tidy > 10)
                   {
                      y2 = ystart + ysize;
                      break;
                   }
                   if (y2 >= ystart + ysize)
                      break;
              }

          bin2 = dev_binIm[x2 + y2 * imxsize];

           if (bin2 >= 0)
              {
               test[(tidx + tidy * imxsize) * 221 + s_pixels[tx * blockDim.y + ty]] = bin + bin2 * 80;
               s_pixels[tx * blockDim.y + ty]++;
              }
          x2++;
        }           
     }          

  } 

  for (int offset = (blockDim.x * blockDim.y) / 2; offset > 0; offset >>= 1)
    {
     if ((tx * blockDim.y + ty) < offset)
       {
         s_pixels[tx * blockDim.y + ty] += s_pixels[tx * blockDim.y + ty + offset];
       }
      __syncthreads ();
     }

   if (tx * blockDim.y + ty == 0)
     {
        per_block_results[blockIdx.x * gridDim.y + blockIdx.y] = s_pixels[0];

     }

}

I use 2-D threading.

ptxas info : Compiling entry function '_Z20kernel_CalculateRFCHiiiiiiPiS_' for 'sm_10' ptxas info : Used 16 registers, 128 bytes smem, 8 bytes cmem[1] .

16 registers is shown in every case in every device.

Any ideas for why this could be happening would be very enlightning.

lower occupancy - better performance

Answers (1)

Related Questions