Reputation: 13
Hi I recently have a CUDA kernel to optimize. Here is the original CUDA kernel:
__glboal__ void kernel_base( float *data, int x_dim, int y_dim )
{
int ix = blockIdx.x;
int iy = blockIdx.y*blockDim.y + threadIdx.y;
int idx = iy*x_dim + ix;
float tmp = data[idx];
if( ix % 2 )
{
tmp += sqrtf( sinf(tmp) + 1.f );
}
else
{
tmp += sqrtf( cosf(tmp) + 1.f );
}
data[idx] = tmp;
}
dim3 block( 1, 512 );
dim3 grid( 2048/1, 2048/512 );
kernel<<<grid,block>>>( d_data, 2048, 2048 );
The basic problem here is the dilemma of memory coalescing and thread divergence. The original code processes the array in a column major, so it has strided memory access pattern, but no divergence. I could change it to row-major, which again has the problem of thread divergence.
So does anyone have better idea how to maximize the performance?
Upvotes: 1
Views: 634
Reputation: 151799
Thread divergence here isn't a big problem compared to the strided memory access, in terms of performance. I would go for coalescing. Furthermore, your data storage has an implicit AoS ordering. If you can reorder the data to SoA, you can solve both problems.
So I would reorder this kernel to first handle things in a row-major fashion. This solves the coalescing problem but introduces warp divergence.
If you're unable to re-order the data, I would then consider eliminating warp divergence by modifying the indexing scheme, so that even warps handle even elements, and odd warps handle odd elements.
This will eliminate warp divergence, but will break perfect coalescing again, but the caches should help with this issue. In the case of Fermi, the L1 cache should smooth over this pattern pretty well. I would then compare this case against the warp divergent case, to see which is faster.
Upvotes: 5
Reputation: 21465
Take into account that
sin(x) = cos(x + pi/2)
Accordingly, you can replace the if ... else
conditions to
tmp += sqrtf( cosf(tmp + (ix%2) * pi/2) + 1.f );
avoiding branch divergence.
Upvotes: 3
Reputation: 995
If I were doing this, I would make the block sizes 16 x 16 or some other shape with a lower aspect ratio. I would use shared memory to grab 2 blocks worth of data (each idx grabs 2 elements from data, probably separated by blockDim.x elements), then have each block do its assigned "odd" rows followed by the "even" rows. You'll have to recompute ix, and iy, (and probably idx as well) and you'll use 1/2 as many blocks, but there should be coalesced memory access followed by non-divergent code.
Upvotes: 0