Dariusz
Dariusz

Reputation: 11

CUDA Sorting Many Vectors / Arrays

I have many (200 000) vectors of integers (around 2000 elements in each vector) in GPU memory.

I am trying to parallelize algorithm which needs to sort, calculate average, standard deviation and skewness for each vector.

In the next step, the algorithm has to delete the maximal element and repeated calculation of statistical moments until some criteria is not fulfilled for each vector independently.

I would like to ask someone more experienced what is the best approach to parallelize this algorithm.

Is it possible to sort more that one vector at once?

Maybe is it better to not parallelize sorting but the whole algorithm as one thread?

Upvotes: 1

Views: 375

Answers (1)

einpoklum
einpoklum

Reputation: 131546

200 000 vectors of integers ... 2000 elements in each vector ... in GPU memory.

2,000 integers sounds like something a single GPU block could tackle handily. They would fit in its shared memory (or into its register file, but that would be less useful for various reasons), so you wouldn't need to sort them in global memory. 200,000 vector = 200,000 blocks; but you can't have 2000 block threads - that excessive

You might be able to use cub's block radix sort, as @talonmies suggests, but I'm not too sure that's the right thing to do. You might be able to do it with thrust, but there's also a good chance you'll have a lot of overhead and complex code (I may be wrong though). Give serious consideration to adapting an existing (bitonic) sort kernel, or even writing your own - although that's more challenging to get right.

Anyway, if you write your own kernel, you can code your "next step" after sorting the data.

Maybe is it better to not parallelize sorting but the whole algorithm as one thread?

This depends on how much time your application spends on these sorting efforts at the moment, relative to its entire running time. See also Amdahl's Law for a more formal statement of the above. Having said that - typically it should be worthwhile to parallelize the sorting when you already have data in GPU memory.

Upvotes: 2

Related Questions