Reputation: 132108

How do I do the converse of shfl.idx (i.e. warp scatter instead of warp gather)?

With CUDA's shfl.idx instruction, we perform what is essentially an intra-warp gather: Each lane provides a datum and an origin lane, and gets the datum of the origin lane.

What about the converse operation, scatter? I mean, not scattering to memory, but to lanes. That is, each lane provides a datum and a destination lane, and for lanes with exactly one other lane targeting them - they end up with the targeting lane's value; other lanes end up with an undefined/arbitrary value.

I'm pretty sure PTX doesn't have something like this. Does it perhaps exist in SASS somehow? If not, is there a better way of implementing this than, say, scattering to shared memory and loading from shared memory, both by lane index?

Upvotes: 2

Answers (2)

einpoklum

Reputation: 132108

As things stand in today's GPU's (Hopper and earlier) - you just don't. There's no hardware support for inter-lane scattering.

So, maybe just do this the straightforward way - via shared memory:

Get a shared memory scratch buffer of 32 elements
Have each lane write its datum to the buffer, at an offset equal to its intended destination
Synchronize (__syncthreads() I guess)
Have each lane read its corresponding buffer element

Assuming no two lanes write to the same place (otherwise the scatter itself would be have undefined result) - this will work, and require two shared memory operations and the sync.

Upvotes: 1

Chris Kitching

Reputation: 2655

The shuffle operations are all defined in terms of the lane to read from. The CUDA functions map almost directly to the ptx instructions, which themselves map almost directly to the SASS. They're all variations on the operation "Make this value available for others to read, and read the value from the given target lane", with various convenient ways to specify the target lane.

In general, you should attempt to rejig your function so you don't need the "scatter" operation. There isn't an instruction that does what you want.

Implementing this using the existing warp intrinsics is probably possible, but not obvious. You could use a sequence of shuffles similar to what you'd use for a warp reduction to transmit source lane IDs, and follow up with a final shuffle to fetch the payloads into the needed lanes.

Upvotes: 2

How do I do the converse of shfl.idx (i.e. warp scatter instead of warp gather)?

Answers (2)

Related Questions