Reputation: 10258
We are experimenting with a CUDA CFD code, in which the 3D field is decomposed into 1D or 2D thread blocks (horizontal slices from the 3D field). In many cases, all threads within one block share a recurring calculation which only depends on height, i.e. it is similar for all threads within a thread block. What would be the best way to optimize such a situation (especially if the recurring calculation is the most expensive one in the entire kernel)? Is it somehow possible to let one thread do the calculation, and share the outcome with the rest of the threads?
With my limited CUDA knowledge, the best I can come up with is something like: if (threadIdx.x
and .y
are 0) calculate recurring value, put it into shared memory (?), all threads use this value from shared memory in their calculations. Would that be efficient, or are there better ways?
Upvotes: 1
Views: 195
Reputation: 1781
There are a number of ways to do what you want to do, but without more information it's impossible to say which will be best.
Some options, ordered roughly from "the expensive calcualation is not very complex" to "the expensive caluclation is extremely complex":
It all comes down to what your expensive calculation actually is. Ideally you'd have provided a bit more information in the question!
Upvotes: 1