Reputation: 805
Why don't we see twice better performance when executing a 64-bit operations (e.g. double precision operation) on a 64-bit machine, compared to executing on a 32-bit machine?
In a 32-bit machine, don't we need to fetch from memory twice as much? More importantly, don't we need twice as much cycles to execute a 64-bit operation?
Upvotes: 2
Views: 3405
Reputation: 41753
In a 32-bit machine, don't we need to fetch from memory twice as much?
No. In most modern CPUs, memory bus width is at least 64 bits. Newer microarchitectures may have wider bus. Quad-channel memory will have a minimum 256-bit bus. Many contemporary CPUs even support 6 or 8-channel memory. So you need only 1 fetch to get a double. Besides most of the time the value has already been in cache, so loading it won't take much time. CPUs don't load a single value but the whole cache line each time
more importantly, don't we need twice as much cycles to execute a 64-bit operation?
First you should know that the actual number of significant bits in double is 53 so it's not "twice as much" harder. It's more than twice the number in float
(24 significant bits). When the number of bits is doubled, addition and subtraction is twice as hard while multiplication is 4 times as hard. Many other more complex operations will need even more effort
But despite that harder math work, non-memory operations on both float and double are usually the same on most modern architectures because both will be done in the same set of registers with the same ALU/FPU. Those powerful FPUs can add two doubles in a single cycle, so obviously even if you can add two floats faster then it still consumes 1 cycle. In the old Intel x87 the internal registers are 80 bits in length and both single and double precision must be extended to 80 bits, hence their performance will also be the same. There's no way to do math in narrower types than 80-bit extended
With SIMD support like SSE2/AVX/AVX-512 you'll be able to process 2/4/8 doubles at a time (or even more in other SIMD ISAs), so you can see that adding two doubles like that is only a tiny task for modern FPUs. However with SIMD we can fit twice the number of floats in a register compared to double, so float operations will be faster if you need to do a lot of math in parallel. In a cycle if you can work on 4 doubles at a time then you'll be able to do the same on 8 floats
Another case where float
is faster than double
would be when you work on a huge array, because more floats fit in the same cache line than double. As a result using double
incurs more cache misses when you traverse through the array
Upvotes: 3
Reputation: 80265
“64-bit machine” is an ambiguous term but usually means that the processor's General-Purpose Registers are 64-bit wide. Compare 8086 and 8088, which have the same instruction set and can both be called 16-bit processors in this sense.
When the phrase is used in this sense, it has nothing to do with the width of the memory bus, the width of the internal buses inside the CPU, and the ability of the ALU to operate efficiently on 32- or 64-bit wide data.
Your question also assumes that the hardest part of a multiplication is moving the operands to the unit that takes care of multiplication inside the processor, which wouldn't be quite true even if the operands came from memory and the bus was 32-bit wide, because latency != throughput. Also, regarding the mathematics of floating-point multiplication, a 64-bit multiplication is not twice as hard as a 32-bit one, it is roughly (53/24)2 times as hard (but, again, the transistors can be there to compute the double-precision multiplication efficiently regardless of the width of the General-Purpose Registers).
Upvotes: 8