Reputation: 4575
I have a kernel where each warp accumulates the sum of a chunk of data.
At the end of the calculation, I have a situation where the last lane of each warp has to send data to the first lane of the next warp.
Currently, this is done via shared memory:
__shared__ int previous_warp_last_lane_val[WARPS_IN_X];
auto lane_id = threadIdx.x & 0x1F;
auto warp_id = threadIdx.x >> 5;
if (lane_id == (warpSize - 1))
{
// store the last lane value for the next warp
previous_warp_last_lane_val[warp_id] = data;
}
__syncthreads(); // shared memory, maybe there is a better alternative?
// first lane retrieve the data from the last lane of previous warp
int prev = 0;
if (lane_id == 0 && warp_id > 0)
{
prev = previous_warp_last_lane_val[warp_id - 1];
}
This works, but feels rather inefficient. I have to sync all warps where in theory I only have to synchronize 2.
So is there a more efficient way to achieve the same functionality? Is shared memory the only alternative?
Upvotes: 1
Views: 854
Reputation: 1340
Your code actually works just fine if all of the warps can be scheduled at the same time on the SM. Various parameters like registers per thread play a role in this, and you can also use profilers to see how the warps are scheduled. You can try reducing threads per block and blocks per grid to make all the blocks resident. Lastly, occupancy calculator API helps with this.
__global__ void MyKernel(int *array, int arrayCount) {
__shared__ int previous_warp_last_lane_val[WARPS_IN_X];
auto lane_id = threadIdx.x & 0x1F;
auto warp_id = threadIdx.x >> 5;
if (lane_id == (warpSize - 1)) {
// store the last lane value for the next warp
previous_warp_last_lane_val[warp_id] = data;
}
__syncthreads();
// first lane retrieve the data from the last lane of previous warp
int prev = 0;
if (lane_id == 0 && warp_id > 0){
prev = previous_warp_last_lane_val[warp_id - 1];
}
}
int launchMyKernel(int *array, int arrayCount) {
int blockSize; // The launch configurator returned block size
int minGridSize; // The minimum grid size needed to achieve the
// maximum occupancy for a full device
// launch
int gridSize; // The actual grid size needed, based on input
// size
int sharedMemory; // Size of the shared memory
cudaOccupancyMaxPotentialBlockSize(
&minGridSize,
&blockSize,
(void*)MyKernel,
sharedMemory,
arrayCount);
// Round up according to array size
gridSize = (arrayCount + blockSize - 1) / blockSize;
MyKernel<<<gridSize, blockSize, sharedMemory>>>(array, arrayCount);
cudaDeviceSynchronize();
// If interested, the occupancy can be calculated with
// cudaOccupancyMaxActiveBlocksPerMultiprocessor
return 0;
}
Upvotes: 1