Reputation: 516
Are there any built-in CUDA kernel functions that are equivalent to OpenCL 2.0 work_group_* functions? I'm specifically looking for work_group_scan_exclusive_add and work_group_reduce_add. My naive implementations of these operations do not perform as well as OpenCL's built-in functions, and I expect that an implementation using __shfl could be used to speed things up with CUDA.
Upvotes: 0
Views: 153
Reputation: 151924
CUDA itself does not provide this functionality.
The CUB library was built with this purpose in mind.
The block-level primitives are summarized here.
This page has reference code for implementation of a block reduce.
Upvotes: 2