Reputation: 629
The CUDA programming guide introduced the concept of warp vote function, "_all", "_any" and "__ballot".
My question is: what applications will use these 3 functions?
Upvotes: 11
Views: 8973
Reputation: 7482
As an example of algorithm that uses __ballot API i would mention the In-Kernel Stream Compaction by D.M Hughes et Al. It is used in prefix sum part of the stream compaction to count (per warp) the number of elements that passed the predicate.
Here the paper. In-k Stream Compaction
Upvotes: 1
Reputation: 481
CUDA provides several warp-wide broadcast and reduction operations that NVIDIA’s architectures efficiently support. For example, __ballot(predicate) instruction evaluates predicate for all active threads of the warp and returns an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active [Reference: Flexible Software Profiling of GPU Architectures].
Upvotes: 0
Reputation: 21475
The prototype of __ballot
is the following
unsigned int __ballot(int predicate);
If predicate
is nonzero, __ballot
returns a value with the N
th bit set, where N
is the thread index.
Combined with atomicOr
and __popc
, it can be used to accumulate the number of threads in each warp having a true predicate.
Indeed, the prototype of atomicOr
is
int atomicOr(int* address, int val);
and atomicOr
reads the value pointed to by address
, performs a bitwise OR
operation with val
, and writes the value back to address
and returns its old value as a return parameter.
On the other side, __popc
returns the number of bits set withing a 32
-bit parameter.
Accordingly, the instructions
volatile __shared__ u32 warp_shared_ballot[MAX_WARPS_PER_BLOCK];
const u32 warp_sum = threadIdx.x >> 5;
atomicOr(&warp_shared_ballot[warp_num],__ballot(data[tid]>threshold));
atomicAdd(&block_shared_accumulate,__popc(warp_shared_ballot[warp_num]));
can be used to count the number of threads for which the predicate is true.
For more details, see Shane Cook, CUDA Programming, Morgan Kaufmann
Upvotes: 8
Reputation: 5154
__ballot
is used in CUDA-histogram and in CUDA NPP library for quick generation of bitmasks, and combining it with __popc
intrinsic to make a very efficient implementation of boolean reduction.
__all
and __any
was used in reduction before introduction of __ballot
, though I can not think of any other use of them.
Upvotes: 5