Reputation: 145
I was doing some research into integer division on the Arduino Uno, and trying to determine the fastest methods of solving division.
In readings, I came across this post.
I grabbed and modified the sketch that Nick Gammon posted in reply #4, the ultimate aim being to compile a list of the computations that I need to perform, with their timings.
Here is my modified sketch:
#define DATATYPE byte
#define OPERATION i / j
volatile DATATYPE x=0;
volatile DATATYPE y=0;
//volatile DATATYPE j=7;
unsigned long time;
unsigned long loop_time;
void setup ()
{
Serial.begin(115200);
Serial.println ();
time = micros();
for (DATATYPE i = 1; i < 101; i++) {
//for (DATATYPE j = 1; j < 101; j++) {
for (DATATYPE j = 56; j < 57; j++) {
y++;
} // end of for j
} // end of for i
loop_time = micros() - time;
Serial.print("Loop baseline Completion time: ");
Serial.print(loop_time);
Serial.print(" ");
time = micros();
for (DATATYPE i = 1; i < 101; i++) {
//for (DATATYPE j = 1; j < 101; j++) {
for (DATATYPE j = 56; j < 57; j++) {
y++;
x = OPERATION;
} // end of for j
} // end of for i
time = micros() - time;
Serial.print("Completion time: ");
Serial.print(time);
Serial.print(" Operation time: ");
time = time - loop_time;
Serial.println(time);
} // end of setup
void loop() {}
The output of this is:
Loop baseline Completion time: 52 Completion time: 644 Operation time: 592
This essentially gives a time of 5usec for each division operation using bytes.
However, changing two loops from :
for (DATATYPE j = 56; j < 57; j++) {
to
for (DATATYPE j = 57; j < 58; j++) {
gives the following output:
Loop baseline Completion time: 52 Completion time: 108 Operation time: 56
I have tried a few different vales for the inner loop (7, 56, 58, 3) etc
3 gives a similar result, but 7, 11 and 13 do not
Can anyone tell me why there would be such a big difference in the times taken to perform the operation?
Upvotes: 2
Views: 576
Reputation: 241901
If the compiler knows what the denominator in a division is, it may be able to compute the division with a multiply (by the inverse of the denominator) and a shift. Since division is really slow, even on CPUs which have divide instructions, this technique, often called reciprocal multiplication, can be a huge win.
Unfortunately, some denominators are easier to optimize than others, particularly when multiplication is restricted to 16 bits. So you're likely to see different behaviour with different constant divisors.
I gather that the compiler is not seeing the better optimisation available for that loop, which is to set a temporary to 0 and use it as the result of the division. When i
reaches 56 (or 57 as the case may be), the temporary is incremented. I think this would radically reduce execution time.
Curiosity got the best of me so I installed the current Ubuntu avr-gcc package, which is v5.4.0, and tried some compiles. My program is much simpler than the original:
uint8_t dodiv(uint8_t d) {
return d / DIVISOR;
}
Then I compiled it with commands like:
avr-gcc -S -Wall -O3 -o - -mmcu=atmega32 -DDIVISOR=57 avrdiv.c
As far as I know, -mmcu=atmega32
is correct for an Arduino Uno.
Here are a few results, restricted to the body of the dodiv
function. I don't think it really explains the differential results in OP's test between 56 and 57, but I might be missing something. It certainly shows why 3 and 7 might have very different timings.
/* DIVISOR 3 */ /* DIVISOR 7 */ /* DIVISOR 56 */ /* DIVISOR 57 */
dodiv: dodiv: dodiv: dodiv:
ldi r25,lo8(-85) ldi r25,lo8(37) lsr r24 ldi r25,lo8(9)
mul r24,r25 mul r24,r25 lsr r24 mul r24,r25
mov r24,r1 mov r25,r1 lsr r24 mov r24,r1
clr __zero_reg__ clr __zero_reg__ ldi r25,lo8(37) clr __zero_reg__
lsr r24 sub r24,r25 mul r24,r25 lsr r24
ret lsr r24 mov r24,r1 ret
add r24,r25 clr __zero_reg__
lsr r24 ret
lsr r24
ret
Upvotes: 2