Reputation: 3526
I'm writing a program which performs millions of modular additions. For more efficiency, I started thinking about how machine-level instructions can be used to implement modular additions.
Let w be the word size of the machine (typically, 32 or 64 bits). If one takes the modulus to be 2^w, then the modular addition can be performed very fast: It suffices to simply add the addends, and discard the carry.
I tested my idea using the following C code:
#include <stdio.h>
#include <time.h>
int main()
{
unsigned int x, y, z, i;
clock_t t1, t2;
x = y = 0x90000000;
t1 = clock();
for(i = 0; i <20000000 ; i++)
z = (x + y) % 0x100000000ULL;
t2 = clock();
printf("%x\n", z);
printf("%u\n", (int)(t2-t1));
return 0;
}
Compiling using GCC with the following options (I used -O0
to prevent GCC from unfolding the loop):
-S -masm=intel -O0
The relevant part of the resulting assembly code is:
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add eax, edx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
As is evident, no modular arithmetic whatsoever is involved.
Now, if we change the modular addition line of the C code to:
z = (x + y) % 0x0F0000000ULL;
The assembly code changes to (only the relevant part is shown):
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add edx, eax
cmp edx, -268435456
setae al
movzx eax, al
mov DWORD PTR [esp+44], eax
mov ecx, DWORD PTR [esp+44]
mov eax, 0
sub eax, ecx
sal eax, 28
mov ecx, edx
sub ecx, eax
mov eax, ecx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
Obviously, a great number of instructions were added between the two calls to _clock
.
Considering the increased number of assembly instructions, I expected the performance gain by proper choice of the modulus to be at least 100%. However, running the output, I noted that the speed is increased by only 10%. I suspected the OS is using the multi-core CPU to run the code in parallel, but even setting the CPU affinity of the process to 1 didn't change anything.
Could you please provide me with an explanation?
Edit: Running the example with VC++ 2010, I got what I expected: the second code is around 12 times slower than the first example!
Upvotes: 2
Views: 567
Reputation: 183873
For the power-of-2 modulus, the code for the computation generated with -O0
and -O3
is identical, the difference is the loop-control code, and the running time differs by a factor of 3.
For the other modulus, the difference in the loop-control code is the same, but the code for the computation is not quite identical (the optimised code looks like it should be a bit faster, but I don't know enough about assembly or my processor to be sure). The difference in running time between unoptimised and optimised code is about 2×.
Running times for both moduli are similar with unoptimised code. About the same as the running time without any modulus. About the same as the running time of the executable obtained by removing the computation from the generated assembly.
So the running time is completely dominated by the loop control code
mov DWORD PTR [esp+40], 0
jmp L2
L3:
# snip
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
With optimisations turned on, the loop counter is kept in a register (here) and decremented, then the jump instruction is a jne
. That loop control is so much faster that the modulus computation now takes a significant fraction of the running time, removing the computation from the generated assembly now reduces the running time by a factor of 3 resp. 2.
So when compiled with -O0
, you're not measuring the speed of the computation, but the speed of the loop control code, thus the small difference. With optimisations, you are measuring both, computation and loop control, and the difference of speed in the computation shows clearly.
Upvotes: 2
Reputation: 11821
The difference between the two boils down to the fact that divisions by powers of 2 can be transformed easily in logic instruction.
a/n
where n is power of two is equivalent to a >> log2 n
for the modulo it's the same
a mod n
can be rendered by a & (n-1)
But in your case it goes even further than that: your value 0x100000000ULL is 2^32. This means that any unsigned 32bit variable will automatically be a modulo 2^32 value. The compiler was smart enough to remove the operation because it is an unnecessary operation on 32 bit variables. The ULL specifier doesn't change that fact.
For the value 0x0F0000000 which fits in a 32 bit variable, the compiler can not elide the operation. It uses a transformation that seems faster than a division operation.
Upvotes: 1