How to decide how to parallelize nested loops in a GPU?

Question

Suppose I have an algorithms which I want to implement on a GPU. The algorithm consists of a main loop, and all iterations of the loop can be run in parallel. Also, each iteration of the loop has an inner loop whose iterations can be run in parallel. Lets say that I need N iterations of the main loop, and M iterations of the inner loop (per main loop iteration), and that my GPU has L cores.

If N+N*M <= L, I can run everything in parallel. But if this is not the case, I need to decide what to run sequentially. How should I make this decision? For example, if N=10, M=5, L = 20, when should I choose each of these options (or any other options)?:

Run all main iterations in parallel, and all inner loop sequentially.
Run all main iterations sequentially, and all inner loop in parallel.
Run all main iterations in parallel, two of the inner loops in parallel and the rest sequentially.
Run three of the main iterations in parallel, run each of their inner loops in parallel, run the rest of the main iterations and their inner loops sequentially.

einpoklum · Accepted Answer

You shouldn't care if everything can actually run in parallel or not. When writing a GPU kernel for an embarssingly-parallel problem like you describe, you could just have a two-dimensional N x M grid, each element of which is a thread which performs the j'th iteration of the 'ith inner loop.

However... there are most often considerations which make it worthwhile to do things differently. For example - you might be able to unroll the inner loop if M is not too large; or you might have code which is supposed to run after all M iterations of the inner loop, and synchronizing the threads may not be worth the while (seeing how you typically max out your GPU's parallelism with N >> 1). Also, the memory access patterns play a very significant role in deciding what to try and have done in parallel (see, for example, this presentation).

So, there isn't really a general answer. Or, perhaps, the answer would be:

Implement something you feel might be a good idea.
Profile it.
See whether you're effectively utilizing the GPU's resources.
alter your implementation approach accordingly.
Repeat.

(as suggested in another relevant presentation, and apologies for this answer being a bit vague and broad.)

How to decide how to parallelize nested loops in a GPU?

Answers (1)

Related Questions