Reputation: 3079
For a homework assignment I need to optimize a loop to run in under 7.5 seconds. I think I may have done this because my code runs in 4 seconds. However, I am worried I am not doing it correctly because my instructor told us that anything too far under 7.5 seconds is probably wrong. So I am worried that I might not be doing things correctly. Here is the original code:
#include <stdio.h>
#include <stdlib.h>
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main (void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
for (i = 0; i < N_TIMES; i++) {
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
}
return 0;
}
Here is my optimization:
for (i = 0; i < N_TIMES; i++) {
int j;
for (j = 0; j < ARRAY_SIZE/2; j += 20) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7] + array[j+8] + array[j+9];
sum1 += array[j+10] + array[j+11] + array[j+12] + array[j+13] + array[j+14] + array[j+15] + array[j+16] + array[j+17] + array[j+18] + array[j+19];
}
}
sum += sum1;
Are these doing the same number of arithmetic operations? Did I change the code somehow or am I just optimizing well?
Upvotes: 0
Views: 712
Reputation: 142625
It could be optimized in two ways, one is to improve the algorithm, the technique is to improve it at instruction level i.e doing every operation at faster speed as you can. By looking at your code, it seems you're trying to achieve the second one and you're doing it quite rightly. One of the feature found in modern processor is use of "instruction pipelining", there're few stages of it. The order of code execution is -
IF Instruction Fetch
ID Instruction Decode
EX Execution
Mem Memory access
WB Write Back
These op could be done in parralel i.e while you're doing ID for an op, you can do IF for the next op in advance. In first technique, sum += array[j];
in this implementation IF holds up for previous operation to become executed completely i.e in a result of stalled cpu cycles. IF, ID, EX, Mem, WB they all take 1 cpu cycle therefore 5 cpu cycle to complete the full instruction. But with loop unrolling,
sum += array[j]; // first op
sum += array[j+1]; // second op
sum += array[j+2];
sum += array[j+3];
sum += array[j+4]; // fifth op
in this implementation, while executing the first one's ID, doing IF is available for the second on a same cycle i.e simultaneously. On second cpu cycle, you're doing ID of first operation and IF of second operation; on 3rd cycle, you've IF on third op, ID on second op and Ex on first op, therefore it's utilizing instruction level parallelism and reduces number of stalled cpu cycles.
Based on this technique a typical way of optimizing loop is "unrolling" it ie. loop unrolling, you can get a full schematic view and details of "loop unrolling" and instruction pipeling in this link.
To get a proof of what I tried to explin, lets have a test. I've compiled your code and created two executable with two different loop, I used perf to see to get an idea at how things went, the followings are the results:
Performance counter stats for './test':
17739.862565 task-clock # 1.000 CPUs utilized
183 context-switches # 0.010 K/sec
5 cpu-migrations # 0.000 K/sec
138 page-faults # 0.008 K/sec
===> 58,408,599,809 cycles # 3.293 GHz
===> 34,387,134,201 stalled-cycles-frontend # 58.87% frontend cycles idle
===> 4,229,714,038 stalled-cycles-backend # 7.24% backend cycles idle
72,056,092,464 instructions # 1.23 insns per cycle
# 0.48 stalled cycles per insn
6,011,271,479 branches # 338.857 M/sec
618,206 branch-misses # 0.01% of all branches
17.744254427 seconds time elapsed
and now with unroll-loop-test:
Performance counter stats for './unroll-loop-test':
2395.115499 task-clock # 1.000 CPUs utilized
22 context-switches # 0.009 K/sec
2 cpu-migrations # 0.001 K/sec
138 page-faults # 0.058 K/sec
====> 7,885,935,372 cycles # 3.293 GHz
====> 1,569,263,256 stalled-cycles-frontend # 19.90% frontend cycles idle
====> 50,629,264 stalled-cycles-backend # 0.64% backend cycles idle
24,911,629,893 instructions # 3.16 insns per cycle
# 0.06 stalled cycles per insn
153,158,495 branches # 63.946 M/sec
607,999 branch-misses # 0.40% of all branches
2.395806562 seconds time elapsed
Take a close look at the number of cycles executed, with unroll loop - stalled-cycles are much less thus requires less number of cpu cycles, on the other hand - without unrolling - number of stalled-cycles is consuming more cpu cycles and thus poor performance. So, yes you're doing quite nice optimization and they're executing same number of arithmatic operations. But also remember that, if you're running this program on a multiprocessor system, then another level of optimization would be to split the whole program into few parts and assign each part to each CPU available on the system and that is something known as "Parallel Programming". Hope my answer will clarify your concept.
Upvotes: 1
Reputation: 667
Calloc sets all elements in the array to zero (actually all bits are set to zero). So you are really adding zero a bunch of times.
So let me run through some ways to potentially go faster, beyond what you are doing (which is good, you are avoiding comparisons, though if your array size wasn't a multiple of 20 or whatever, you would have issues).
It may or may not be slightly faster to initialize your array statically and set the values to zero;
double array[ARRAY_SIZE] = {0};
Technically {} should work,but {0} is probably more explicit.
a for loop will reinitialize j to 0 every time. Declare int j outside of both loops and you probably save ARRAY_SIZE operations.
For example Carl Friedrich Gauss supposedly figured out as a child that if your sequence is 1 2 3 4 .. n (n is the last number) then the sum is (n * (n + 1)) / 2 If n is 4, 1 + 2 + 3 + 4 = 10 and (4 * 5) /2 also equals ten.
There are other sequences, like the sum of consecutive squared numbers IE (1^2 + 2^2 + 3^2 + 4^2.. n^2). Read https://en.wikipedia.org/wiki/Square_pyramidal_number for more on that.
Anyway my point is understanding math is important to optimization.
You could potentially do something like:
int i, j;
double d;
for (i = 0; i < ARRAY_SIZE; i++) {
if (array[i] == 0)
continue;
for (j = 0; j < N_TIMES; j++) {
d += array[i];
}
}
That's untested, because I doubt it would be acceptable, but the pattern, skipping to the next iteration of the loop in a common case to avoid subsequent unnecessary instructions, is a common optimization practice.
In an unoptimized compiler, using a pointer may be faster then an index.
IE, you could loop with:
double * arrayend = array + (ARRAY_SIZE - 1);
double * valuep;
for(valuep = array; valuep <= arrayend; valuep++) {
//inner stuff
}
!= may be faster then < , though don't use equality for comparing non integers.
Using unsigned numbers MAY be faster, though probably not in your case. Signed vs Unsigned operations in C
Integers are probably faster then doubles, but may not be big enough for actual math.
edit: also another one. if you know the size of the cache value of the system, you can potentially optimize for that.
Upvotes: 0
Reputation: 667
Calloc sets all elements in the array to zero (actually all bits are set to zero). So you are really adding zero a bunch of times.
So let me run through some ways to potentially go faster, beyond what you are doing (which is good, you are avoiding comparisons, though if your array size wasn't a multiple of 20 or whatever, you would have issues).
It may or may not be slightly faster to initialize your array statically and set the values to zero;
double array[ARRAY_SIZE] = {0};
Technically {} should work,but {0} is probably more explicit.
a for loop will reinitialize j to 0 every time. Declare int j outside of both loops and you probably save ARRAY_SIZE operations.
For example Carl Friedrich Gauss supposedly figured out as a child that if your sequence is 1 2 3 4 .. n (n is the last number) then the sum is (n * (n + 1)) / 2 If n is 4, 1 + 2 + 3 + 4 = 10 and (4 * 5) /2 also equals ten.
There are other sequences, like the sum of consecutive squared numbers IE (1^2 + 2^2 + 3^2 + 4^2.. n^2). Read https://en.wikipedia.org/wiki/Square_pyramidal_number for more on that.
Anyway my point is understanding math is important to optimization.
You could potentially do something like:
int i, j;
double d;
for (i = 0; i < ARRAY_SIZE; i++) {
if (array[i] == 0)
continue;
for (j = 0; j < N_TIMES; j++) {
d += array[i];
}
}
That's untested, because I doubt it would be acceptable, but the pattern, skipping to the next iteration of the loop in a common case to avoid subsequent unnecessary instructions, is a common optimization practice.
In an unoptimized compiler, using a pointer may be faster then an index.
IE, you could loop with:
double * arrayend = array + (ARRAY_SIZE - 1);
double * valuep;
for(valuep = array; valuep <= arrayend; valuep++) {
//inner stuff
}
!= may be faster then < , though don't use equality for comparing non integers.
Using unsigned numbers MAY be faster, though probably not in your case. Signed vs Unsigned operations in C
Integers are probably faster then doubles, but may not be big enough for actual math.
Upvotes: 0
Reputation: 44298
you can do:
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sum *= N_TIMES;
return 0;
but this reduces the operations... to maintain the operations, this will maintain cache hits, and even register hits.
int main (void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
int j;
double d;
for (j = 0; j < ARRAY_SIZE; j++) {
d = array[j];
for (i = 0; i < N_TIMES; i++) {
sum += d;
}
}
return 0;
}
Upvotes: 0
Reputation: 179422
Your optimizations are not correct:
for (j = 0; j < ARRAY_SIZE/2; j += 20) {
You now loop half as many times in the inner loop as you should.
Upvotes: 2