CUDA threads appending variable amounts of data to common array

Question

My application takes millions of input records, each 8 bytes, and hashes each one into two or more output bins. That is, each input key K creates a small number of pairs (B1,K), (B2,K), ... The number of output bins per key is not known until the key is processed. It's usually 2 but could occasionally be 10 or more.

All those output pairs need to be eventually stored in one array since all the keys in each bin will later be processed together. How to do this efficiently?

Using an atomic increment to repeatedly reserve a pair from a global array sounds horribly slow. Another obvious method would be to init a hash table as an array of pointers to some sort of storage per bin. That looks slower.

I'm thinking of pre-reserving 2 pairs per input record in a block shared array, then grabbing more space as needed (i.e., a reimplementation of the STL vector reserve operation), then having the last thread in each block copying the block shared array to global memory.

However I'm not looking forward to implementing that. Help? Thanks.

CUDA threads appending variable amounts of data to common array

Answers (1)

Related Questions