mchen
mchen

Reputation: 10156

Will performance be hit if a kernel is too short?

If I'm doing an element-by-element operation on a matrix M, say M[i, j] *= (1 - M[i, j]), is it fine to launch a thread for each element (i, j)? I'm just concerned at what point the overhead of launching threads outweighs the parallelism achieved.

Upvotes: 0

Views: 109

Answers (1)

alrikai
alrikai

Reputation: 4194

It's oftentimes a better idea to try to do more work per thread if possible, with the goal of having instruction-level parallelism. If a given thread executes multiple, independant operations, the instructions can be pipelined and executed without stalls, which will increase your arithmetic throuput. In contrast, if you have each thread doing 1 piece of (trivial) work, then there's no opportunity for any sort of instruction-level parallelism and no opportunity to hide any of your memory latency times.

Also, there's a finite number of registers available, so the more threads you launch with, the fewer the number of register available per thread. I'm not sure about Kepler cards, but back in the Fermi-card generation, registers had roughly 8x the bandwidth of shared memory, so using registers when possible was important (again, I don't have a kepler card, so I don't know if this has since changed).

Although it's a bit dated, the recommendations detailed here are still very relevant

Upvotes: 1

Related Questions