Reputation: 4314
There is a well-known trick to perform division by an invariant integer actually without doing division at all, but instead doing multiplication. This has been discussed on Stack Overflow as well in Perform integer division using multiplication and in Why does GCC use multiplication by a strange number in implementing integer division?
However, I recently tested the following code both on AMD64 and on ARM (Raspberry Pi 3 model B):
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
int main(int argc, char **argv)
{
volatile uint64_t x = 123456789;
volatile uint64_t y = 0;
struct timeval tv1, tv2;
int i;
gettimeofday(&tv1, NULL);
for (i = 0; i < 1000*1000*1000; i++)
{
y = (x + 999) / 1000;
}
gettimeofday(&tv2, NULL);
printf("%g MPPS\n", 1e3 / ( tv2.tv_sec - tv1.tv_sec +
(tv2.tv_usec - tv1.tv_usec) / 1e6));
return 0;
}
The code is horribly slow on ARM architectures. In contrast, on AMD64 it's extremely fast. I noticed that on ARM, it calls __aeabi_uldivmod, whereas on AMD64 it actually does not divide at all but instead does the following:
.L2:
movq (%rsp), %rdx
addq $999, %rdx
shrq $3, %rdx
movq %rdx, %rax
mulq %rsi
shrq $4, %rdx
subl $1, %ecx
movq %rdx, 8(%rsp)
jne .L2
The question is, why? Is there some particular feature on the ARM architecture that makes this optimization infeasible? Or is it just due to the rareness of the ARM architecture that optimizations like this haven't been implemented?
Before people start suggestions in their comments, I'll say I tried both gcc and clang, and also tried the -O2 and -O3 optimization levels.
On my AMD64 laptop, it gives 1181.35 MPPS, whereas on the Raspberry Pi it gives 5.50628 MPPS. This is over 2 orders of magnitude difference!
Upvotes: 1
Views: 1459
Reputation: 363940
gcc only uses a multiplicative inverse for division of register-width or narrower. You're testing x86-64 against ARM32, so uint64_t
gives x86-64 a huge advantage in this case.
Extended-precision multiply would probably be worth it on 32-bit CPUs with high-throughput multiply, like modern x86, and maybe also your Cortex-A7 ARM if it has a multiplier that's much better pipelined than its divider.
It would take multiple mul instructions to get the high half of a 64b x 64b => 128b
full multiply result using only 32x32 => 64b as a building block. (IIRC ARM32 has this.)
However, this is not what gcc or clang choose to do, at any optimization level.
If you want to handicap your x86 CPU, compile 32-bit code using -m32
. x86 gcc -O3 -m32
will use __udivdi3
. I wouldn't call this "fair", though, because a 64-bit CPU is much faster at 64-bit arithmetic, and a Cortex-A7 doesn't have a 64-bit mode available.
OTOH, a 32-bit-only x86 CPU couldn't be much faster than current x86 CPUs in 32-bit mode; the main cost of the extra transistors that go unused in 32-bit mode is die area and power, not so much top-end clock speed. Some low-power-budget CPUs (like ULV laptop chips) might sustain max turbo for longer if they were designed without support for long mode (x86-64), but that's pretty minor.
So it might be interesting to benchmark 32-bit x86 vs. 32-bit ARM, just to learn something about the microarchitectures maybe. But if you care about 64-bit integer performance, definitely compile for x86-64 not x86-32.
Upvotes: 3