Fan Zhang
Fan Zhang

Reputation: 629

About warp voting function

The CUDA programming guide introduced the concept of warp vote function, "_all", "_any" and "__ballot".

My question is: what applications will use these 3 functions?

Upvotes: 11

Views: 8973

Answers (4)

Davide Spataro
Davide Spataro

Reputation: 7482

As an example of algorithm that uses __ballot API i would mention the In-Kernel Stream Compaction by D.M Hughes et Al. It is used in prefix sum part of the stream compaction to count (per warp) the number of elements that passed the predicate.

Here the paper. In-k Stream Compaction

Upvotes: 1

husin alhaj ahmade
husin alhaj ahmade

Reputation: 481

CUDA provides several warp-wide broadcast and reduction operations that NVIDIA’s architectures efficiently support. For example, __ballot(predicate) instruction evaluates predicate for all active threads of the warp and returns an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active [Reference: Flexible Software Profiling of GPU Architectures].

Upvotes: 0

Vitality
Vitality

Reputation: 21475

The prototype of __ballot is the following

unsigned int __ballot(int predicate);

If predicate is nonzero, __ballot returns a value with the Nth bit set, where N is the thread index.

Combined with atomicOr and __popc, it can be used to accumulate the number of threads in each warp having a true predicate.

Indeed, the prototype of atomicOr is

int atomicOr(int* address, int val);

and atomicOr reads the value pointed to by address, performs a bitwise OR operation with val, and writes the value back to address and returns its old value as a return parameter.

On the other side, __popc returns the number of bits set withing a 32-bit parameter.

Accordingly, the instructions

volatile __shared__ u32 warp_shared_ballot[MAX_WARPS_PER_BLOCK];

const u32 warp_sum = threadIdx.x >> 5;

atomicOr(&warp_shared_ballot[warp_num],__ballot(data[tid]>threshold));

atomicAdd(&block_shared_accumulate,__popc(warp_shared_ballot[warp_num]));

can be used to count the number of threads for which the predicate is true.

For more details, see Shane Cook, CUDA Programming, Morgan Kaufmann

Upvotes: 8

aland
aland

Reputation: 5154

__ballot is used in CUDA-histogram and in CUDA NPP library for quick generation of bitmasks, and combining it with __popc intrinsic to make a very efficient implementation of boolean reduction.

__all and __any was used in reduction before introduction of __ballot, though I can not think of any other use of them.

Upvotes: 5

Related Questions