phimuemue
phimuemue

Reputation: 35993

How do GPUs (Geforce 9800) implement bitwise integer operations?

CUDA gives the programmer the possibility to write something like a & b | ~ c (a, b, c being unsigned ints).

What does the GPU do internally? Does it somehow "emulate" bitwise operations on integers or are they similarily efficient like on a traditional CPU?

Upvotes: 2

Views: 3862

Answers (1)

wnbell
wnbell

Reputation: 1240

According to the CUDA Programming Guide v2.3 (Section 5.1.1.1) the bitwise operations run at full speed (8 operations per clock cycle).

Integer Arithmetic

Throughput of integer add is 8 operations per clock cycle.

Throughput of 32-bit integer multiplication is 2 operations per clock cycle, but mul24 provide 24-bit integer multiplication with a troughput of 8 operations per clock cycle. On future architectures however, mul24 will be slower than 32-bit integer multiplication, so we recommend to provide two kernels, one using mul24 and the other using generic 32-bit integer multiplication, to be called appropriately by the application.

Integer division and modulo operation are particularly costly and should be avoided if possible or replaced with bitwise operations whenever possible: If n is a power of 2, (i/n) is equivalent to (i>>log2(n)) and (i%n) is equivalent to (i&(n-1)); the compiler will perform these conversions if n is literal.

Comparison Throughput of compare, min, max is 8 operations per clock cycle.

Bitwise Operations Throughput of any bitwise operation is 8 operations per clock cycle.

Upvotes: 5

Related Questions