Elad Maimoni
Elad Maimoni

Reputation: 4575

Can a Cuda warp communicate with a different warp without using shared memory?

I have a kernel where each warp accumulates the sum of a chunk of data.

At the end of the calculation, I have a situation where the last lane of each warp has to send data to the first lane of the next warp.

Currently, this is done via shared memory:

__shared__ int previous_warp_last_lane_val[WARPS_IN_X];

auto lane_id = threadIdx.x & 0x1F;
auto warp_id = threadIdx.x >> 5; 

if (lane_id == (warpSize - 1))
{
    // store the last lane value for the next warp
    previous_warp_last_lane_val[warp_id] = data;
}

__syncthreads(); // shared memory, maybe there is a better alternative?

// first lane retrieve the data from the last lane of previous warp 
int prev = 0;
if (lane_id == 0 && warp_id > 0)
{
    
    prev = previous_warp_last_lane_val[warp_id - 1];
}

This works, but feels rather inefficient. I have to sync all warps where in theory I only have to synchronize 2.

So is there a more efficient way to achieve the same functionality? Is shared memory the only alternative?

Upvotes: 1

Views: 854

Answers (1)

AmirSojoodi
AmirSojoodi

Reputation: 1340

Your code actually works just fine if all of the warps can be scheduled at the same time on the SM. Various parameters like registers per thread play a role in this, and you can also use profilers to see how the warps are scheduled. You can try reducing threads per block and blocks per grid to make all the blocks resident. Lastly, occupancy calculator API helps with this.

__global__ void MyKernel(int *array, int arrayCount) {
    __shared__ int previous_warp_last_lane_val[WARPS_IN_X];

    auto lane_id = threadIdx.x & 0x1F;
    auto warp_id = threadIdx.x >> 5; 

    if (lane_id == (warpSize - 1)) {
        // store the last lane value for the next warp
        previous_warp_last_lane_val[warp_id] = data;
    }

    __syncthreads();

    // first lane retrieve the data from the last lane of previous warp 
    int prev = 0;
    if (lane_id == 0 && warp_id > 0){
        prev = previous_warp_last_lane_val[warp_id - 1];
    }
}

int launchMyKernel(int *array, int arrayCount) {
    int blockSize;      // The launch configurator returned block size
    int minGridSize;    // The minimum grid size needed to achieve the
                        // maximum occupancy for a full device
                        // launch
    int gridSize;       // The actual grid size needed, based on input
                        // size
    int sharedMemory;   // Size of the shared memory

    cudaOccupancyMaxPotentialBlockSize(
        &minGridSize,
        &blockSize,
        (void*)MyKernel,
        sharedMemory,
        arrayCount);

    // Round up according to array size
    gridSize = (arrayCount + blockSize - 1) / blockSize;

    MyKernel<<<gridSize, blockSize, sharedMemory>>>(array, arrayCount);
    cudaDeviceSynchronize();

    // If interested, the occupancy can be calculated with
    // cudaOccupancyMaxActiveBlocksPerMultiprocessor

    return 0;
}

Upvotes: 1

Related Questions