Performance optimization with modular addition when the modulus is of special form

Question

I'm writing a program which performs millions of modular additions. For more efficiency, I started thinking about how machine-level instructions can be used to implement modular additions.

Let w be the word size of the machine (typically, 32 or 64 bits). If one takes the modulus to be 2^w, then the modular addition can be performed very fast: It suffices to simply add the addends, and discard the carry.

I tested my idea using the following C code:

#include 
#include 

int main()
{
    unsigned int x, y, z, i;
    clock_t t1, t2;

    x = y = 0x90000000;

    t1 = clock();

    for(i = 0; i <20000000 ; i++)
        z = (x + y) % 0x100000000ULL;

    t2 = clock();

    printf("%x
", z);
    printf("%u
", (int)(t2-t1));

    return 0;
}

Compiling using GCC with the following options (I used -O0 to prevent GCC from unfolding the loop):

-S -masm=intel -O0

The relevant part of the resulting assembly code is:

    mov DWORD PTR [esp+36], -1879048192
    mov eax, DWORD PTR [esp+36]
    mov DWORD PTR [esp+32], eax
    call    _clock
    mov DWORD PTR [esp+28], eax
    mov DWORD PTR [esp+40], 0
    jmp L2
L3:
    mov eax, DWORD PTR [esp+36]
    mov edx, DWORD PTR [esp+32]
    add eax, edx
    mov DWORD PTR [esp+44], eax
    inc DWORD PTR [esp+40]
L2:
    cmp DWORD PTR [esp+40], 19999999
    jbe L3
    call    _clock

As is evident, no modular arithmetic whatsoever is involved.

Now, if we change the modular addition line of the C code to:

z = (x + y) % 0x0F0000000ULL;

The assembly code changes to (only the relevant part is shown):

    mov DWORD PTR [esp+36], -1879048192
    mov eax, DWORD PTR [esp+36]
    mov DWORD PTR [esp+32], eax
    call    _clock
    mov DWORD PTR [esp+28], eax
    mov DWORD PTR [esp+40], 0
    jmp L2
L3:
    mov eax, DWORD PTR [esp+36]
    mov edx, DWORD PTR [esp+32]
    add edx, eax
    cmp edx, -268435456
    setae   al
    movzx   eax, al
    mov DWORD PTR [esp+44], eax
    mov ecx, DWORD PTR [esp+44]
    mov eax, 0
    sub eax, ecx
    sal eax, 28
    mov ecx, edx
    sub ecx, eax
    mov eax, ecx
    mov DWORD PTR [esp+44], eax
    inc DWORD PTR [esp+40]
L2:
    cmp DWORD PTR [esp+40], 19999999
    jbe L3
    call    _clock

Obviously, a great number of instructions were added between the two calls to _clock.

Considering the increased number of assembly instructions, I expected the performance gain by proper choice of the modulus to be at least 100%. However, running the output, I noted that the speed is increased by only 10%. I suspected the OS is using the multi-core CPU to run the code in parallel, but even setting the CPU affinity of the process to 1 didn't change anything.

Could you please provide me with an explanation?

Edit: Running the example with VC++ 2010, I got what I expected: the second code is around 12 times slower than the first example!

Daniel Fischer · Accepted Answer

Art nailed it.

For the power-of-2 modulus, the code for the computation generated with -O0 and -O3 is identical, the difference is the loop-control code, and the running time differs by a factor of 3.

For the other modulus, the difference in the loop-control code is the same, but the code for the computation is not quite identical (the optimised code looks like it should be a bit faster, but I don't know enough about assembly or my processor to be sure). The difference in running time between unoptimised and optimised code is about 2×.

Running times for both moduli are similar with unoptimised code. About the same as the running time without any modulus. About the same as the running time of the executable obtained by removing the computation from the generated assembly.

So the running time is completely dominated by the loop control code

    mov DWORD PTR [esp+40], 0
    jmp L2
L3:
    # snip
    inc DWORD PTR [esp+40]
L2:
    cmp DWORD PTR [esp+40], 19999999
    jbe L3

With optimisations turned on, the loop counter is kept in a register (here) and decremented, then the jump instruction is a jne. That loop control is so much faster that the modulus computation now takes a significant fraction of the running time, removing the computation from the generated assembly now reduces the running time by a factor of 3 resp. 2.

So when compiled with -O0, you're not measuring the speed of the computation, but the speed of the loop control code, thus the small difference. With optimisations, you are measuring both, computation and loop control, and the difference of speed in the computation shows clearly.

Performance optimization with modular addition when the modulus is of special form

Answers (2)

Related Questions