Reputation: 163
Sorting data in each segment in an array on GPU, the size of segment is 32, and there are no sorting or merging further for different segments. So I load the data of the each segment into the shared memory from global memory, and store the data into the global memory after I finished sorting of each segment. What's the parallel algorithm is prefer for higher throughput?
Upvotes: 0
Views: 427
Reputation: 131960
I suggest you use the in-warp bitonic sort, which is implemented very efficiently using the SHFL instruction of the Kepler architecture. See code in this GTC 2013 presentation:
Kepler's SHUFFLE(SHFL) Instruction: Tips and Tricks
Using it will also mean you won't need to bother with shared memory, just load one value by each thread into a register.
Upvotes: 2
Reputation: 3438
Since segment sizes are all 32, I personally suggest merge sort. There's also this paper you can refer to.
Upvotes: 1