Reputation: 35993
CUDA gives the programmer the possibility to write something like a & b | ~ c
(a
, b
, c
being unsigned int
s).
What does the GPU do internally? Does it somehow "emulate" bitwise operations on integers or are they similarily efficient like on a traditional CPU?
Upvotes: 2
Views: 3862
Reputation: 1240
According to the CUDA Programming Guide v2.3 (Section 5.1.1.1) the bitwise operations run at full speed (8 operations per clock cycle).
Integer Arithmetic
Throughput of integer add is 8 operations per clock cycle.
Throughput of 32-bit integer multiplication is 2 operations per clock cycle, but mul24 provide 24-bit integer multiplication with a troughput of 8 operations per clock cycle. On future architectures however, mul24 will be slower than 32-bit integer multiplication, so we recommend to provide two kernels, one using mul24 and the other using generic 32-bit integer multiplication, to be called appropriately by the application.
Integer division and modulo operation are particularly costly and should be avoided if possible or replaced with bitwise operations whenever possible: If n is a power of 2, (i/n) is equivalent to (i>>log2(n)) and (i%n) is equivalent to (i&(n-1)); the compiler will perform these conversions if n is literal.
Comparison Throughput of compare, min, max is 8 operations per clock cycle.
Bitwise Operations Throughput of any bitwise operation is 8 operations per clock cycle.
Upvotes: 5