Reputation: 388
Consider the following code snippet in C++ :(visual studio 2015)
First Block
const int size = 500000000;
int sum =0;
int *num1 = new int[size];//initialized between 1-250
int *num2 = new int[size];//initialized between 1-250
for (int i = 0; i < size; i++)
{
sum +=(num1[i] / num2[i]);
}
Second Block
const int size = 500000000;
int sum =0;
float *num1 = new float [size]; //initialized between 1-250
float *num2 = new float [size]; //initialized between 1-250
for (int i = 0; i < size; i++)
{
sum +=(num1[i] / num2[i]);
}
I expected that first block runs faster because it is integer operation . But the Second block is considerably faster , although it is floating point operation . here is results of my bench mark : Division:
Type Time
uint8 879.5ms
uint16 885.284ms
int 982.195ms
float 654.654ms
As well as floating point multiplication is faster than integer multiplication. here is results of my bench mark :
Multiplication:
Type Time
uint8 166.339ms
uint16 524.045ms
int 432.041ms
float 402.109ms
My system spec: CPU core i7-7700 ,Ram 64GB,Visual studio 2015
Upvotes: 18
Views: 12210
Reputation: 136515
Floating point number division is faster than integer division because of the exponent part in floating point number representation. Division of exponent parts of floating point number value representations requires only a relatively cheap fixed-cost subtraction.
int32_t
division requires fast division of 31-bit numbers, whereas float
division requires fast division of 24-bit mantissas (the leading one in mantissa is implied and not stored in a floating point number) and faster subtraction of 8-bit exponents.
The fast division computes its result iteratively. It takes a fixed number of iterations (upper bound) to compute for any specific bit-width of operands, going from low to high significant bits of the numerator. When an intermediate denominator in that loop turns to 1 it ends the compute loop, hopefully, earlier than that fixed upper bound of the number of iterations, hence fast division. CPU division instructions have latency range instead of fixed latency for this reason.
See an excellent detailed explanation how division is performed in CPU.
Your benchmark code snippet is a plain load-modify-store loop, which is the ideal extreme best-case scenario for compilers to vectorize with SIMD instructions. Hence, you should keep in mind that your time measurements don't extrapolate or generalize to more involved loops.
Upvotes: 36
Reputation: 365981
As Maxim pointed out, your code auto-vectorizes. x86 has SIMD FP division but not SIMD integer.
CPU architects throw more transistors at FP division because it's often important.
FP division is often a part of numerical workloads which can't be avoided.
But integer division is or can be avoided more often in computations that vectorize. And even for scalar code is often rare enough that a slowish divider isn't a big bottleneck. (Division by constants is done with high-half multiply and shifts, which are available in SIMD and are cheap for scalar integer. Without -ffast-math
or equivalent for other compilers / languages, FP division by most constants could only be manually replaced by a multiply by the reciprocal, with only round numbers like 2.0
having exactly-representable inverses.)
With modern transistor budgets being so big, scalar integer dividers do get some love, e.g. Intel beefing them up in Ice Lake, and Apple M1 having remarkable 1/clock throughput for their integer divider. But as recently as Skylake, Intel left 64-bit integer division being microcoded as over 50 uops. https://uops.info/
So CPUs have big powerful SIMD FP div/sqrt units, vs. just scalar integer. So much so that even double
precision FP division (53-bit mantissa) can be faster than int32_t
(31-bit): one per 4 cycle divsd
/divpd xmm
vs. one per 6 cycle [i]div r32
on Skylake / Ice Lake / Alder Lake P-cores
On Intel, scalar and 128-bit divpd can probably use separately the 128-bit lanes of the 256-bit SIMD divider unit, since its instruction throughput is twice that of vdivpd ymm
.
Latency is pretty similar between idiv r32
and [v]divsd/pd
xmm or ymm on Ice Lake and later, 10-15 cycles or 13-15 cycles respectively. Pre-Ice Lake, the integer division unit was higher latency but still pipelined for the same 6c throughput.
TL:DR: FP division is fast because CPU architects throw a lot of hardware at it, to get a 53-bit mantissa done about as fast as 32-bit integer. And with better throughput (especially if you vectorize with SIMD, since each divpd xmm
instruction get two divisions done, so it's actually a 2:6 ratio of throughput cost vs. 4:6 for scalar divsd vs. integer idiv
.)
Also, Intel makes FP division single-uop vs. integer division being microcoded (especially before Ice Lake which reduced the uop counts to 4 or less for most cases, down from 10 for 32-bit, 36 or 57 for u64/i64). So out-of-order exec can see farther around FP division. And if your instruction mix includes little enough FP division vs. mul/add/fma, and doesn't bottleneck on FP division, it basically only costs the same as a mul/add/fma. Floating point division vs floating point multiplication AMD has had integer division only being 2 uops for a long time. (Not 1 since it has to write 2 output registers, rDX and rAX, except 8-bit division writing AH:AL = AX but that's also 2 uops on AMD, although only 1 on Gracemont E-cores.)
Upvotes: 5