CUDA Thrust reduce_by_key using less memory

Question

I am trying to reduce the memory required to compute reduce_by_key for my use case. I have a relatively small number of unique keys (around 100-150) compared to the number of values (around 16 million). The reduce by key example shows that the device_vectors allocated to contain the result is of the same size as that of the inputs. Is it always necessary to do so? Is it possible to only allocate as much memory as necessary to contain the correct output?

Drop · Accepted Answer

Output size of reduction depends on input data, and this value is not typically known before reduction. However, depending on your problem, sometimes you know this size.

Reasonable implementations would require only at least number of key spans elements for the output. And thrust::reduce_by_key seems to be included in this list.

Examples:

Non-sorted (common case)

Output size is hard to predict

keys        2   2   2   3   3   2   1
values      1   1   1   1   1   1   1
           |----------|------|----|---|
                    4 spans

reduced         3       2       1   1

Sorted (best case)

Output size is equal to number of unique keys

keys        1   2   2   2   2   3   3
values      1   1   1   1   1   1   1
           |--|---------------|------|
                    3 spans

reduced     1           4         2

Interleaved, no adjacent keys repeated (worst case)

Output size is equal to input size

keys        1   2   3   1   2   3   1
values      1   1   1   1   1   1   1
           |--|---|---|---|---|---|--|
                    7 spans

reduced     1   1   1   1   1   1   1

Another thing, is that if you don't need keys to be output, you could discard them with thrust::discard_iterator and save some extra space, as described here.

CUDA Thrust reduce_by_key using less memory

Answers (1)

Examples:

Non-sorted (common case)

Sorted (best case)

Interleaved, no adjacent keys repeated (worst case)

Related Questions