Optimal place to call __syncthreads()

Question

Given that the code is correct, is there some potential performance benefit in calling __syncthreads as late as possible, as early as possible, or does it not matter? Here's an example with comments that demonstrate the question:

__global__ void kernel(const float* data) {
    __shared__ float shared_data[64];

    if (threadIdx.x < 64) {
        shared_data[threadIdx.x] = data[threadIdx.x];
    }
    // Option #1: Place the call to `__syncthreads()` here?

    // Here is a lot of code that doesn't use `shared_data`.

    // Option #2: Place the call to `__syncthreads()` here?

    // Here is some code that uses `shared_data`.
}

einpoklum · Accepted Answer

What you are facing is a split between where the writes are made and where they should be visible to the entire block.

NVIDIA has recently introduced a mechanism for just that: arrive + wait.

You start with initializing a barrier:

void __mbarrier_init(__mbarrier_t* bar, uint32_t expected_count);

Then you arrive at your "option 1" position, with the bar token you initialized:

__mbarrier_token_t __mbarrier_arrive(__mbarrier_t* bar);

then you have your unrelated code, and then finally, wait for everyone to arrive at your "option 2" position:

bool __mbarrier_test_wait(__mbarrier_t* bar, __mbarrier_token_t token);

... but note that this call doesn't block, i.e you'll have to actively "wait".

Alternatively, you can use NVIDIA's C++ wrappers for this mechanism, presented here.

Note that this functionality is relatively new, with Compute Capability at least 7.0 required, and 8.0 or later recommended.

Optimal place to call __syncthreads()

Answers (1)

Related Questions