Reputation: 103
I'm fairly new to CUDA programming, so please forgive me if this is a silly question.
In CUDA, I'm trying to populate a small device array B (~20000 int
elements) with the contents of a large device array A (~20 million int
elements). A contains mostly zeros, but has about ~20000 non-zero elements, located at random and unknown positions in the array. I'd like to fill B with the non-zero contents of A using CUDA. The order of the elements within B is not important.
I've looked at the SDK and I found a number of "reduce" strategies for, e.g., parallel summing of an array, but each of these approaches reduce the array to a scalar, whereas I'm trying to "reduce" an array to a smaller array. Searching online hasn't yielded anything either. I'm not looking for full code, but just some ideas/links of how to implement this. I'm using C, and if possible, I'd like to do this without using any C++ classes or structures.
Thank you in advance for your assistance.
Upvotes: 0
Views: 216
Reputation: 152113
what you're describing sometimes goes by the name stream compaction
Thrust (e.g. copy_if
) and
cub (e.g. DeviceSelect
) offer options that should have relatively good performance.
If you did want to implement it yourself, stream compaction may use a sequence of lower-level parallel operations, a key one being a prefix sum. You can get an idea of the build-up of a simple parallel prefix sum (and stream compaction) in GPU Gems. I'm just adding this for informational purposes; I'm not suggesting you implement either stream compaction or a prefix sum yourself.
A complete stream compaction example using the GPU Gems prefix sum method is here. However, I strongly encourage anyone to use thrust or cub, instead.
Upvotes: 2