WRF
WRF

Reputation: 137

CUDA threads appending variable amounts of data to common array

My application takes millions of input records, each 8 bytes, and hashes each one into two or more output bins. That is, each input key K creates a small number of pairs (B1,K), (B2,K), ... The number of output bins per key is not known until the key is processed. It's usually 2 but could occasionally be 10 or more.

All those output pairs need to be eventually stored in one array since all the keys in each bin will later be processed together. How to do this efficiently?

Using an atomic increment to repeatedly reserve a pair from a global array sounds horribly slow. Another obvious method would be to init a hash table as an array of pointers to some sort of storage per bin. That looks slower.

I'm thinking of pre-reserving 2 pairs per input record in a block shared array, then grabbing more space as needed (i.e., a reimplementation of the STL vector reserve operation), then having the last thread in each block copying the block shared array to global memory.

However I'm not looking forward to implementing that. Help? Thanks.

Upvotes: 5

Views: 1055

Answers (1)

Christian Sarofeen
Christian Sarofeen

Reputation: 2250

Using an atomic increment to repeatedly reserve a pair from a global array sounds horribly slow.

You could increment bins of a global array instead of one entry at a time. In other words, you could have a large array, each thread could start with 10 possible output entries. If the thread over flows it requests for the next available bin from the global array. If you're worried about slow speed with the 1 atomic number, you could use 10 atomic numbers to 10 portions of the array and distribute the accesses. If one gets full, find another one.

I'm also considering processing the data twice: the 1st time just to determine the number of output records for each input record. Then allocate just enough space and finally process all the data again.

This is another valid method. The bottleneck is calculating the offset of each thread into the global array once you have the total number of results for each thread. I haven't figured a reasonable parallel way to do that.

The last option I can think of, would be to allocate a large array, distribute it based on blocks, used a shared atomic int (would help with slow global atomics). If you run out of space, mark that the block didn't finish, and mark where it left off. On your next iteration complete the work that hasn't been finished.

Downside of course of the distributed portions of global memory is like talonmies said... you need a gather or compaction to make the results dense.

Good luck!

Upvotes: 1

Related Questions