shuhalo
shuhalo

Reputation: 6452

How to quickly double floating-point numbers?

It's an old trick that a positive integer is quick to double using bitshifts:

integer_variable << 1 == 2 * integer_variable

Of course, that trick relies on the bit representation of (non-negative) numbers and ignores buffer overflows.

Question: is there a similar trick for fast doubling of floating-point numbers? I presume that multiplying by 2. uses generic multiplication circuitry, even though doubling could be performed much faster with a specialized hardware instruction or hacking the bit representation.

For example, in the case of normalized floating-point numbers, one could possibly exploit the bit representation as well and manipulate the exponent. Are there any known hacks like that which provide a practical benefit?

Upvotes: 0

Views: 104

Answers (1)

Floating Pontiff
Floating Pontiff

Reputation: 106

It's an old trick that a positive integer is quick to double using bitshifts:

integer_variable << 1 == 2 * integer_variable

If we're talking about C (or C++, or Rust, etc.), modern compilers will treat these as equivalent, and may use LEA on amd64 to sum two registers or LSL on aarch64 to shift left by one no matter how you phrase it in C. They may even merge them into complex addressing modes like compiling p[2*x] into a single mov eax, DWORD PTR [rdi+rsi*8] instruction on amd64 (Godbolt).

But maybe you're asking about choosing CPU instructions like amd64 SHL vs IMUL, where SHL is likely to be cheaper—Agner Fog's instruction table certainly shows a higher latency and reciprocal throughput for IMUL than for SHL on many microarchitectures.

Question: is there a similar trick for fast doubling of floating-point numbers? I presume that multiplying by 2. uses generic multiplication circuitry, even though doubling could be performed much faster with a specialized hardware instruction or hacking the bit representation.

For example, in the case of normalized floating-point numbers, one could possibly exploit the bit representation as well and manipulate the exponent. Are there any known hacks like that which provide a practical benefit?

You can manipulate the bit representation—add to the exponent bits. Here are a couple ways you might do it on amd64 with double as IEEE 754 binary64:

double dbl_mul(double x) {
  return 2*x;
}

double dbl_union(double x) {
  union { uint64_t i; double d; } u = { .d = x };
  u.i += 0x0010000000000000;
  return u.d;
}

double dbl_asm(double x) {
  double y = x;
  uint64_t e = 0x0010000000000000;
  asm("paddq %[e],%[y]" : [y]"+x"(y) : [e]"x"(e));
  return y;
}

Let's look at the assembly that gcc 12.2 generates for this (Godbolt):

dbl_mul:
        addsd   xmm0, xmm0
        ret
dbl_union:
        movabs  rax, 4503599627370496
        movq    rdx, xmm0
        add     rax, rdx
        movq    xmm0, rax
        ret
dbl_asm:
        movabs  rax, 4503599627370496
        movq    xmm1, rax
        paddq   xmm1,xmm0
        ret

What's the cost of dbl_mul? On many microarchitectures, the CPU can execute multiple floating-point additions and multiplications in parallel—in Agner Fog's tables, this shows as a reciprocal throughput of 0.5, e.g. on Skylake for ADDSD and MULSD (ADDSS/D PS/D and MULSS/D PS/D).

So although the latency of addsd xmm0, xmm0 itself may be four cycles (which on Skylake is the same as for mulsd!), you can get results from addsd xmm0, xmm0; addsd xmm1, xmm1 in an average of two cycles per add—even ignoring the parallelism you could have gotten with addpd in a single vector register.

In contrast, for dbl_union, there is a penalty to moving between general-purpose registers (movq rdx, xmm0 and movq xmm0, rax), and because of the data dependencies, this is serialized. So, going by Agner Fog's tables on Skylake, you have to wait:

  • two cycles for movabs rax,0x0010000000000000; movq rdx, xmm0,
  • another cycle for add rax, rdx, and then
  • another two cycles for movq xmm0, rax,

for a total of five cycles of latency—already more than addsd xmm0, xmm0. Worse, there's no opportunity for parallelism here, either by the CPU's execution units or by explicit vectorization like addpd which together can double four binary64 floating-point numbers in only four cycles, at five times the throughput of dbl_union's instruction sequence.

What about dbl_asm, where we use paddq on xmm registers instead of moving to general-purpose registers? It turns out this hurts too, because even though we're not moving between registers, there is a domain-crossing penalty or bypass delay for mixing floating-point operations and integer operations in the same data path.

Intel recommends against doing this in their optimization guide (248966-045, February 2022):

6.1 General rules on SIMD integer code

[…]

When writing SIMD code that works for both integer and floating-point data, use the subset of SIMD convert instructions or load/store instructions to ensure that the input operands in XMM registers contain data types that are properly defined to match the instruction.

Code sequences containing cross-typed usage produce the same result across different implementations but incur a significant performance penalty. Using SSE/SSE2/SSE3/SSSE3/SSE44.1 instructions to operate on type-mismatched SIMD data in the XMM register is strongly discouraged.

At least you can get vectorization out of this approach, but it's still unlikely to be competitive with just using addsd or mulsd.


Of course, all of this assumes the input is normal and the output doesn't overflow! If you need to handle those cases you have to pay for more arithmetic and/or branches. Using a library function like ldexp(x,n), which returns x * 2^n, is tempting, but you will almost always incur a function call overhead and register save/restore cycles in that case, on top of the same expensive bit manipulation across domains and so on.

In the end, you are almost certainly better off just writing 2*x in C—and if you need micro-optimization, focus on vectorization, not on bit manipulation tricks. Similar deal if you are tempted to use the infamous fast inverse square root which is now totally obsolete given CPU instructions for inverse square root like amd64 rsqrtss and aarch64 fsqrte/fsqrts.

All that said, I'm only hypothesizing based on a theoretical understanding of the microarchitectures—you still need to measure to find out for sure!

Upvotes: 1

Related Questions