Bart
Bart

Reputation: 10258

cuda: shared 'constants' amongst thread block

We are experimenting with a CUDA CFD code, in which the 3D field is decomposed into 1D or 2D thread blocks (horizontal slices from the 3D field). In many cases, all threads within one block share a recurring calculation which only depends on height, i.e. it is similar for all threads within a thread block. What would be the best way to optimize such a situation (especially if the recurring calculation is the most expensive one in the entire kernel)? Is it somehow possible to let one thread do the calculation, and share the outcome with the rest of the threads?

With my limited CUDA knowledge, the best I can come up with is something like: if (threadIdx.x and .y are 0) calculate recurring value, put it into shared memory (?), all threads use this value from shared memory in their calculations. Would that be efficient, or are there better ways?

Upvotes: 1

Views: 195

Answers (1)

Jez
Jez

Reputation: 1781

There are a number of ways to do what you want to do, but without more information it's impossible to say which will be best.

Some options, ordered roughly from "the expensive calcualation is not very complex" to "the expensive caluclation is extremely complex":

  1. Do nothing. Just calculate it on every thread. If the calculation isn't very complicated this may be fastest, as it allows the compiler to optimise across the synchronisation.
  2. Do as you are thinking. Calculate on one thread, store in shared memory, synchronize, then read from that shared memory value. The calculation in one thread will be inefficient, and will be latency bound, so you'd probably want quite a few blocks/multiprocessor to hide this latency.
  3. Parallelise the complicated calculation. Maybe there is some parallelism within this complex calcuation, and you exploit this. For example, if you needed to do a sum over a small amount of values you may chose to do a parallel sum instead of a sequential sum.
  4. Precompute. Depending on the scenario it may be better to precompute the expensive data in a separate kernel. This option is probably only good if the expensive calculation is very expensive. This may be faster as the expensive part of the kernel may be using a lot of registers, which may be limiting the occupancy of the rest of the kernel.

It all comes down to what your expensive calculation actually is. Ideally you'd have provided a bit more information in the question!

Upvotes: 1

Related Questions