Timing speed of math operations on Arduino - anomaly

Question

I was doing some research into integer division on the Arduino Uno, and trying to determine the fastest methods of solving division.

In readings, I came across this post.

I grabbed and modified the sketch that Nick Gammon posted in reply #4, the ultimate aim being to compile a list of the computations that I need to perform, with their timings.

Here is my modified sketch:

#define DATATYPE byte
#define OPERATION i / j

volatile DATATYPE x=0;
volatile DATATYPE y=0;
//volatile DATATYPE j=7;
unsigned long time;
unsigned long loop_time;

void setup ()
{
    Serial.begin(115200);
    Serial.println ();

    time = micros();
    for (DATATYPE i = 1; i < 101; i++) {
        //for (DATATYPE j = 1; j < 101; j++) {
        for (DATATYPE j = 56; j < 57; j++) {
            y++;
        } // end of for j
    }  // end of for i
    loop_time = micros() - time;
    Serial.print("Loop baseline Completion time: ");
    Serial.print(loop_time);
    Serial.print(" ");

    time = micros();
    for (DATATYPE i = 1; i < 101; i++) {
        //for (DATATYPE j = 1; j < 101; j++) {
        for (DATATYPE j = 56; j < 57; j++) {
          y++;
            x = OPERATION;
        } // end of for j
    }  // end of for i
    time = micros() - time;
    Serial.print("Completion time: ");
    Serial.print(time);
    Serial.print(" Operation time: ");
    time = time - loop_time;
    Serial.println(time);
} // end of setup

void loop() {}

The output of this is:

Loop baseline Completion time: 52 Completion time: 644 Operation time: 592

This essentially gives a time of 5usec for each division operation using bytes.

However, changing two loops from :

for (DATATYPE j = 56; j < 57; j++) {

to

for (DATATYPE j = 57; j < 58; j++) {

gives the following output:

Loop baseline Completion time: 52 Completion time: 108 Operation time: 56

I have tried a few different vales for the inner loop (7, 56, 58, 3) etc

3 gives a similar result, but 7, 11 and 13 do not

Can anyone tell me why there would be such a big difference in the times taken to perform the operation?

rici · Accepted Answer

If the compiler knows what the denominator in a division is, it may be able to compute the division with a multiply (by the inverse of the denominator) and a shift. Since division is really slow, even on CPUs which have divide instructions, this technique, often called reciprocal multiplication, can be a huge win.

Unfortunately, some denominators are easier to optimize than others, particularly when multiplication is restricted to 16 bits. So you're likely to see different behaviour with different constant divisors.

I gather that the compiler is not seeing the better optimisation available for that loop, which is to set a temporary to 0 and use it as the result of the division. When i reaches 56 (or 57 as the case may be), the temporary is incremented. I think this would radically reduce execution time.

Compilation test

Curiosity got the best of me so I installed the current Ubuntu avr-gcc package, which is v5.4.0, and tried some compiles. My program is much simpler than the original:

uint8_t dodiv(uint8_t d) {
  return d / DIVISOR;
}

Then I compiled it with commands like:

avr-gcc -S -Wall -O3 -o - -mmcu=atmega32 -DDIVISOR=57 avrdiv.c

As far as I know, -mmcu=atmega32 is correct for an Arduino Uno.

Here are a few results, restricted to the body of the dodiv function. I don't think it really explains the differential results in OP's test between 56 and 57, but I might be missing something. It certainly shows why 3 and 7 might have very different timings.

/* DIVISOR 3 */     /* DIVISOR 7 */     /* DIVISOR 56 */    /* DIVISOR 57 */
dodiv:              dodiv:              dodiv:              dodiv:
    ldi r25,lo8(-85)    ldi r25,lo8(37)     lsr r24             ldi r25,lo8(9)
    mul r24,r25         mul r24,r25         lsr r24             mul r24,r25
    mov r24,r1          mov r25,r1          lsr r24             mov r24,r1
    clr __zero_reg__    clr __zero_reg__    ldi r25,lo8(37)     clr __zero_reg__
    lsr r24             sub r24,r25         mul r24,r25         lsr r24
    ret                 lsr r24             mov r24,r1          ret
                        add r24,r25         clr __zero_reg__
                        lsr r24             ret                                              
                        lsr r24                                         
                        ret

Timing speed of math operations on Arduino - anomaly

Answers (1)

Compilation test

Related Questions