user4811
user4811

Reputation: 183

Do complex thread index calculations have an impact on the performance?

I have just asked myself if complex index calculations with e.g. threadIdx.x have an impact on the performance. Do these variables become constant as soon as the kernel is uploaded to the device?

I want to navigate to a huge array where the index depends on threadIdx.x, threadIdx.y and threadIdx.z. I need e.g. modulo operations like

array[threadIdx.y % 2 + ...]

Upvotes: 1

Views: 173

Answers (3)

user4811
user4811

Reputation: 183

If someone is interested, I have evaluated the corresponding PTX code.

(1) Complex thread ID calculations have an impact on the performance. "threadIdx.x" etc. are not constants.

(2) "threadIdx.y % 2" is implemented efficiently and corresponds to "threadIdx.y & 0x00000001" (Cuda Toolkit 5.5).

Upvotes: 0

Vitality
Vitality

Reputation: 21515

I assume that

array[threadIdx.y % 2 + ...]

is just an example.

Generally speaking, the % operations can be slow. A useful trick to speed up index calculations is noting that

foo%n==foo&(n-1) if n is a power of 2

So, perhaps for the example above the compiler will make this optimization for you, but in case you have foo%n, the trick above is worth to be using.

Upvotes: 1

a.lasram
a.lasram

Reputation: 4411

You have an addition and a modulus in your index computation.

From CUDA programming guide: The throughput of operator+ is very high (160 for a 3.5 compute capable GPU).

operator% requires tens of operations with a throughput similar to operator+.

In your case you are using operator% with a literal constant and the compiler will very likely optimize it out. Also your constant is a power of two number (2) so the compiler will replace it with the bitwise operator& (same throughput as operator+).

It is important to profile your application to avoid wasting time optimizing arithmetic operations without gaining any performance. It's common that arithmetic operations are completely hidden by memory load and store operations in which case you need to focus on optimizing memory throughput.

Upvotes: 3

Related Questions