Guillaume Petitjean
Guillaume Petitjean

Reputation: 2718

Cortex M7 floating arithmetic instruction duration with zero operand

I'd like to know whether the duration of a floating point instruction like VMUL is significantly shorter when an operand is zero, on a Cortex M7 FPU.

The reason is that I'm profiling a software that is processing many variables coming from analog sources, and more precisely the evolution of these variables with time. But right now the "front end" (ie. the analog sources) is not available so I'm using simulated variables but since they are not evolving with time, many variables in the code are zero.

Upvotes: 0

Views: 959

Answers (2)

Guillaume Petitjean
Guillaume Petitjean

Reputation: 2718

So I've fought my laziness and did a profiling myself :)

Here is the code of the function I used to execute a loop of double precision vmul on a STM32H753, with GCC (options FPv5-D16, -mfloat-abi=hard, -Ofast):

void __attribute__((noinline))
asmMulDsimple(double a, double b) {
  asm volatile( "vmul.f64 d2, d0, d1 \n"
                "vmul.f64 d2, d0, d1 \n"
                "vmul.f64 d2, d0, d1 \n"
                ...
                ( 100 times )
                ...
                "vmul.f64 d2, d0, d1 \n"
               : [a] "+&r"(a), [b] "+&r"(b)
               :
               : "cc", "memory", "r12");
}

And the calls in the main (Reset_Cycle_Counter and Get_Cycle_Counter are basic functions to use DWT_CYCCNT cycle counter) :

    Reset_Cycle_Counter();
    {
        asmMulDsimple(1.00000001, 2.0000000004);

        printf("Duration with 100 vmul, complex operands: %lu cycles\r\n", Get_Cycle_Counter());
    }


    Reset_Cycle_Counter();
    {
        asmMulDsimple(1, 2);

        printf("Duration with 100 vmul, simple operands: %lu cycles\r\n", Get_Cycle_Counter());
    }

    Reset_Cycle_Counter();
    {
        asmMulDsimple(0, 2.0000000004);

        printf("Duration with 100 vmul, 0 operands: %lu cycles\r\n", Get_Cycle_Counter());
    }

And the output, with cache enabled for both I and D:

Duration with 100 vmul, complex operands: 502 cycles
Duration with 100 vmul, simple operands: 499 cycles
Duration with 100 vmul, 0 operands: 406 cycles

As you can see there is a signifiant difference when an operand is 0, around -20%.

Upvotes: 1

Peter Cordes
Peter Cordes

Reputation: 364458

Pipelined CPUs usually have fixed latencies (not data-dependent) for everything except very slow operations like div. Otherwise you have to deal with write-back conflicts if you start a "fast" instruction a cycle or two after a "slow" instruction.

You could test it yourself by running the vmul in a latency-bound loop (e.g. multiply a register by itself 3 or 4 times in an unrolled loop). Try with "simple" values like 0.0, then with non-simple values like 1.0000000001 (which has many significant digits). Run enough loop iterations to hide measurement overhead, but few enough that you stop before overflow to +Inf.

Upvotes: 3

Related Questions