What is the most efficient way of partitioning my buffer which favours a small memory footprint?

Question

I have a large set of data, represented as a 1D array. Each individual entry is around 50 bytes in size. I would like to partition this data over a given number of buckets on the GPU using CUDA.

In order to be able to fit as much data in a single batch processed on the GPU in one go, I would like the buffer containing the bucket contents to be as small as possible, given that making copies of my input data is not really viable given the memory constraints. Using integer indices referencing the large input buffer seems to be the way to go here.

A naive approach is to allocate a constant number of these references in each bucket. Unfortunately, the input data set is not entirely distributed evenly. This may cause some buckets to end up being full and overflowing, while others may not even have a single entry inside of them. Unfortunately, for my application the buckets have a practical meaning, so entries which are put into a bucket really have to end up in said bucket.

The most memory-efficient possible way as far as I can tell is to allocate an array containing one index per bucket. You subsequently allocate a second buffer containing structs, which in turn contain a pointer to the entry in the large input data buffer, as well as an index to a subsequent element (thus creating a linked list for each entry in the large buffer for each bucket).

This makes the total necessary size of both buffers known on beforehand, allowing for efficient allocation. Unfortunately, while constructing the buffer it is necessary to keep track how much of the buffer has been used up to that point. To avoid race conditions, it's necessary to use an atomic increment on a single variable.

This means many threads will need to wait on a single variable to allocate their next entry in the bucket's linked list. Which I expect will bring performance to its knees in an instant.

I therefore thought it'd be better to keep the idea of using linked lists to define the contents of each bucket, but expand each entry to hold not one but multiple references to the large input buffer, to reduce the number of times atomic operations need to be performed. This comes at the cost for some wasted memory when linked list entries are not entirely filled themselves, though I think this should be acceptable.

I have two questions about this strategy:

Is this a sound strategy?
If not, is there a better one, or one which is used more commonly in similar situations?

tera · Accepted Answer

You already have a one-to-one correspondence between the structs in your buffer and the original data items. So instead of allocating structs in the buffer via incrementing an atomic counter, you can use the index of the original item to locate the struct in the buffer.

If you index the buffer with the same index as your original data, you also don't need to keep pointers to the original data items, and the buffer of structs becomes a simple array of integers to hold the next entry in the linked list.

Alternatively, a two pass approach over the data works as well: In the first pass, just count the number of items in each bucket. Then you can allocate the right amount of memory for each bucket via an inclusive scan before the second pass to actually fill the buckets.

What is the most efficient way of partitioning my buffer which favours a small memory footprint?

Answers (1)

Related Questions