Reputation: 119
Given that the code is correct, is there some potential performance benefit in calling __syncthreads
as late as possible, as early as possible, or does it not matter? Here's an example with comments that demonstrate the question:
__global__ void kernel(const float* data) {
__shared__ float shared_data[64];
if (threadIdx.x < 64) {
shared_data[threadIdx.x] = data[threadIdx.x];
}
// Option #1: Place the call to `__syncthreads()` here?
// Here is a lot of code that doesn't use `shared_data`.
// Option #2: Place the call to `__syncthreads()` here?
// Here is some code that uses `shared_data`.
}
Upvotes: 2
Views: 196
Reputation: 132260
What you are facing is a split between where the writes are made and where they should be visible to the entire block.
NVIDIA has recently introduced a mechanism for just that: arrive + wait.
You start with initializing a barrier:
void __mbarrier_init(__mbarrier_t* bar, uint32_t expected_count);
Then you arrive at your "option 1" position, with the bar token you initialized:
__mbarrier_token_t __mbarrier_arrive(__mbarrier_t* bar);
then you have your unrelated code, and then finally, wait for everyone to arrive at your "option 2" position:
bool __mbarrier_test_wait(__mbarrier_t* bar, __mbarrier_token_t token);
... but note that this call doesn't block, i.e you'll have to actively "wait".
Alternatively, you can use NVIDIA's C++ wrappers for this mechanism, presented here.
Note that this functionality is relatively new, with Compute Capability at least 7.0 required, and 8.0 or later recommended.
Upvotes: 3