CUDA Block-wide Parallel Primitives

Question

Are there any built-in CUDA kernel functions that are equivalent to OpenCL 2.0 work_group_* functions? I'm specifically looking for work_group_scan_exclusive_add and work_group_reduce_add. My naive implementations of these operations do not perform as well as OpenCL's built-in functions, and I expect that an implementation using __shfl could be used to speed things up with CUDA.

Robert Crovella · Accepted Answer

CUDA itself does not provide this functionality.

The CUB library was built with this purpose in mind.

The block-level primitives are summarized here.

This page has reference code for implementation of a block reduce.

CUDA Block-wide Parallel Primitives

Answers (1)

Related Questions