Reputation: 69
The following has been troubling me.
Running the same kernel with two different devices, one with compute capability 1.3 and the other with compute capability 2.0, I get better performance with more threads per block (high occupancy) in the 1.3 but the opposite in the 2.0. The peak of performance for the 2.0 seems to be 16 threads per block, an occupancy of 17% Anything less or anything greater than this point has worst performance.
Since it's most likely the cause of this to be the nature of the kernel itself here it is.
__global__ void
kernel_CalculateRFCH (int xstart, int ystart, int xsize,
int ysize, int imxsize, int imysize, int *test, int *dev_binIm, int *per_block_results)
{
int x2, y2, bin, bin2;
__shared__ int s_pixels[blockDim.x*blockDim.y]; //this wouldn't compile in reailty
int tx = threadIdx.x;
int ty = threadIdx.y;
int tidy = threadIdx.y + blockIdx.y * blockDim.y;
int tidx = threadIdx.x + blockIdx.x * blockDim.x;
if (xstart + xsize > imxsize)
xsize = imxsize - xstart;
if (ystart + ysize > imysize)
ysize = imysize - ystart;
s_pixels[tx * blockDim.y + ty] = 0;
if (tidy >= ystart && tidy < ysize + ystart && tidx >= xstart && tidx < xsize + xstart)
{
bin = dev_binIm[tidx + tidy * imxsize];
if (bin >= 0)
{
x2 = tidx;
y2 = tidy;
while (y2 < ystart + ysize)
{
if (x2 >= xstart + xsize || x2 - tidx > 10)
{
x2 = xstart;
y2++;
if (tidx - x2 > 10)
x2 = tidx - 10;
if (y2 - tidy > 10)
{
y2 = ystart + ysize;
break;
}
if (y2 >= ystart + ysize)
break;
}
bin2 = dev_binIm[x2 + y2 * imxsize];
if (bin2 >= 0)
{
test[(tidx + tidy * imxsize) * 221 + s_pixels[tx * blockDim.y + ty]] = bin + bin2 * 80;
s_pixels[tx * blockDim.y + ty]++;
}
x2++;
}
}
}
for (int offset = (blockDim.x * blockDim.y) / 2; offset > 0; offset >>= 1)
{
if ((tx * blockDim.y + ty) < offset)
{
s_pixels[tx * blockDim.y + ty] += s_pixels[tx * blockDim.y + ty + offset];
}
__syncthreads ();
}
if (tx * blockDim.y + ty == 0)
{
per_block_results[blockIdx.x * gridDim.y + blockIdx.y] = s_pixels[0];
}
}
I use 2-D threading.
ptxas info : Compiling entry function '_Z20kernel_CalculateRFCHiiiiiiPiS_' for 'sm_10' ptxas info : Used 16 registers, 128 bytes smem, 8 bytes cmem[1] .
16 registers is shown in every case in every device.
Any ideas for why this could be happening would be very enlightning.
Upvotes: 1
Views: 845
Reputation: 7255
Apart from the general remarks made above, your kernel is a very special case as most of the threads don't do any work at all. Why don't you add xstart
and ystart
to tidx
and tidy
straight away and choose a smaller grid? Your better performance at smaller blocksize might just be an artefact of how the region of interest is split into blocks.
This also explains why you see a big difference between compute capability 1.x devices versus CC 2.0+ devices. Beginning with CC 2.0 Nvidia GPUs have become a lot better at handling kernels where the runtime varies largely between blocks.
On compute capability 1.x, a new wave of blocks is only scheduled once all currently running blocks have finished, while from CC 2.0 on a new block is started as soon as any old block has finished.
Upvotes: 1