Reputation: 6452
It's an old trick that a positive integer is quick to double using bitshifts:
integer_variable << 1 == 2 * integer_variable
Of course, that trick relies on the bit representation of (non-negative) numbers and ignores buffer overflows.
Question: is there a similar trick for fast doubling of floating-point numbers? I presume that multiplying by 2.
uses generic multiplication circuitry, even though doubling could be performed much faster with a specialized hardware instruction or hacking the bit representation.
For example, in the case of normalized floating-point numbers, one could possibly exploit the bit representation as well and manipulate the exponent. Are there any known hacks like that which provide a practical benefit?
Upvotes: 0
Views: 104
Reputation: 106
It's an old trick that a positive integer is quick to double using bitshifts:
integer_variable << 1 == 2 * integer_variable
If we're talking about C (or C++, or Rust, etc.), modern compilers will treat these as equivalent, and may use LEA
on amd64 to sum two registers or LSL
on aarch64 to shift left by one no matter how you phrase it in C.
They may even merge them into complex addressing modes like compiling p[2*x]
into a single mov eax, DWORD PTR [rdi+rsi*8]
instruction on amd64 (Godbolt).
But maybe you're asking about choosing CPU instructions like amd64 SHL
vs IMUL
, where SHL
is likely to be cheaper—Agner Fog's instruction table certainly shows a higher latency and reciprocal throughput for IMUL
than for SHL
on many microarchitectures.
Question: is there a similar trick for fast doubling of floating-point numbers? I presume that multiplying by 2. uses generic multiplication circuitry, even though doubling could be performed much faster with a specialized hardware instruction or hacking the bit representation.
For example, in the case of normalized floating-point numbers, one could possibly exploit the bit representation as well and manipulate the exponent. Are there any known hacks like that which provide a practical benefit?
You can manipulate the bit representation—add to the exponent bits.
Here are a couple ways you might do it on amd64 with double
as IEEE 754 binary64:
double dbl_mul(double x) {
return 2*x;
}
double dbl_union(double x) {
union { uint64_t i; double d; } u = { .d = x };
u.i += 0x0010000000000000;
return u.d;
}
double dbl_asm(double x) {
double y = x;
uint64_t e = 0x0010000000000000;
asm("paddq %[e],%[y]" : [y]"+x"(y) : [e]"x"(e));
return y;
}
Let's look at the assembly that gcc 12.2 generates for this (Godbolt):
dbl_mul:
addsd xmm0, xmm0
ret
dbl_union:
movabs rax, 4503599627370496
movq rdx, xmm0
add rax, rdx
movq xmm0, rax
ret
dbl_asm:
movabs rax, 4503599627370496
movq xmm1, rax
paddq xmm1,xmm0
ret
What's the cost of dbl_mul
?
On many microarchitectures, the CPU can execute multiple floating-point additions and multiplications in parallel—in Agner Fog's tables, this shows as a reciprocal throughput of 0.5, e.g. on Skylake for ADDSD and MULSD (ADDSS/D PS/D and MULSS/D PS/D).
So although the latency of addsd xmm0, xmm0
itself may be four cycles (which on Skylake is the same as for mulsd
!), you can get results from addsd xmm0, xmm0; addsd xmm1, xmm1
in an average of two cycles per add—even ignoring the parallelism you could have gotten with addpd
in a single vector register.
In contrast, for dbl_union
, there is a penalty to moving between general-purpose registers (movq rdx, xmm0
and movq xmm0, rax
), and because of the data dependencies, this is serialized.
So, going by Agner Fog's tables on Skylake, you have to wait:
movabs rax,0x0010000000000000; movq rdx, xmm0
,add rax, rdx
, and thenmovq xmm0, rax
,for a total of five cycles of latency—already more than addsd xmm0, xmm0
.
Worse, there's no opportunity for parallelism here, either by the CPU's execution units or by explicit vectorization like addpd
which together can double four binary64 floating-point numbers in only four cycles, at five times the throughput of dbl_union
's instruction sequence.
What about dbl_asm
, where we use paddq
on xmm registers instead of moving to general-purpose registers?
It turns out this hurts too, because even though we're not moving between registers, there is a domain-crossing penalty or bypass delay for mixing floating-point operations and integer operations in the same data path.
Intel recommends against doing this in their optimization guide (248966-045, February 2022):
6.1 General rules on SIMD integer code
[…]
When writing SIMD code that works for both integer and floating-point data, use the subset of SIMD convert instructions or load/store instructions to ensure that the input operands in XMM registers contain data types that are properly defined to match the instruction.
Code sequences containing cross-typed usage produce the same result across different implementations but incur a significant performance penalty. Using SSE/SSE2/SSE3/SSSE3/SSE44.1 instructions to operate on type-mismatched SIMD data in the XMM register is strongly discouraged.
At least you can get vectorization out of this approach, but it's still unlikely to be competitive with just using addsd
or mulsd
.
Of course, all of this assumes the input is normal and the output doesn't overflow!
If you need to handle those cases you have to pay for more arithmetic and/or branches.
Using a library function like ldexp(x,n)
, which returns x * 2^n
, is tempting, but you will almost always incur a function call overhead and register save/restore cycles in that case, on top of the same expensive bit manipulation across domains and so on.
In the end, you are almost certainly better off just writing 2*x
in C—and if you need micro-optimization, focus on vectorization, not on bit manipulation tricks.
Similar deal if you are tempted to use the infamous fast inverse square root which is now totally obsolete given CPU instructions for inverse square root like amd64 rsqrtss
and aarch64 fsqrte/fsqrts
.
All that said, I'm only hypothesizing based on a theoretical understanding of the microarchitectures—you still need to measure to find out for sure!
Upvotes: 1