user1034772
user1034772

Reputation: 516

CUDA Block-wide Parallel Primitives

Are there any built-in CUDA kernel functions that are equivalent to OpenCL 2.0 work_group_* functions? I'm specifically looking for work_group_scan_exclusive_add and work_group_reduce_add. My naive implementations of these operations do not perform as well as OpenCL's built-in functions, and I expect that an implementation using __shfl could be used to speed things up with CUDA.

Upvotes: 0

Views: 153

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151924

CUDA itself does not provide this functionality.

The CUB library was built with this purpose in mind.

The block-level primitives are summarized here.

This page has reference code for implementation of a block reduce.

Upvotes: 2

Related Questions