Farzad
Farzad

Reputation: 3438

Do warp vote functions synchronize threads in the warp?

Do CUDA warp vote functions, such as __any() and __all(), synchronize threads in the warp?

In other words, is there any guarantee that all threads inside the warp execute instructions preceding warp vote function, especially the instruction(s) that manipulate the predicate?

Upvotes: 0

Views: 938

Answers (2)

Michael Haidl
Michael Haidl

Reputation: 5482

They don't. You can use warp vote functions within code branches. If they would synchronize in such a case there would be a possible deadlock. From the PTX ISA:

vote

Vote across thread group. Syntax

 vote.mode.pred  d, {!}a;
 vote.ballot.b32 d, {!}a;  // 'ballot' form, returns bitmask

 .mode = { .all, .any, .uni };

Description

Performs a reduction of the source predicate across threads in a warp. The destination > predicate value is the same across all threads in the warp. The reduction modes are:

.all True if source predicate is True for all active threads in warp. Negate the source predicate to compute .none.

.any True if source predicate is True for some active thread in warp. Negate the source predicate to compute .not_all.

.uni True if source predicate has the same value in all active threads in warp. Negating the source predicate also computes .uni.

In the ballot form, vote.ballot.b32 simply copies the predicate from each thread in a warp into the corresponding bit position of destination register d, where the bit position corresponds to the thread's lane id.

EDIT: Since threads within a warp are implicit synchronized you don't have to manually ensure that the threads are properly synchronized when the vote takes place. Note that for __all only active threads participate within the vote. Active threads are threads that execute instructions where the condition is true. This explains why a vote can occur within code branches.

Upvotes: -1

ArchaeaSoftware
ArchaeaSoftware

Reputation: 4422

The synchronization is implicit, since threads within a warp execute in lockstep. [*]

Code that relies on this behavior is known as "warp synchronous."

[*] If you are thinking that conditional code will cause threads within a warp to follow different execution paths, you have more to learn about how CUDA hardware works. Divergent conditional code (i.e. conditional code where the condition is true for some threads but not for others) causes certain threads within the warp to be disabled (either by predication or the branch synchronization stack), but each thread still occupies one of the 32 lanes available in the warp.

Upvotes: 3

Related Questions