Reputation: 7834
I'm working on a homework assignment where I must manually optimize a nested loop (my program will be compiled with optimizations disabled). The goal of the assignment is to run the entire program in less than 6 seconds (extra credit for less than 4.5 seconds).
I'm only allowed to change a small block of code, and the starting point is such:
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
Where ARRAY_SIZE
is 9973. This loop is contained within another loop that is run 200,000 times. This particular version runs in 16 seconds.
What I've done so far is change the implementation to unroll the loop and use pointers as my iterator:
(These declarations are not looped over 200,000 times)
register int unroll_length = 16;
register int *unroll_end = array + (ARRAY_SIZE - (ARRAY_SIZE % unroll_length));
register int *end = array + (ARRAY_SIZE -1);
register int *curr_end;
curr_end = end;
while (unroll_end != curr_end) {
sum += *curr_end;
curr_end--;
}
do {
sum += *curr_end + *(curr_end-1) + *(curr_end-2) + *(curr_end-3) +
*(curr_end-4) + *(curr_end-5) + *(curr_end-6) + *(curr_end-7) +
*(curr_end-8) + *(curr_end-9) + *(curr_end-10) + *(curr_end-11) +
*(curr_end-12) + *(curr_end-13) + *(curr_end-14) + *(curr_end-15);
}
while ((curr_end -= unroll_length) != array);
sum += *curr_end;
Using these techniques, I was able to get the execution down to 5.5 seconds, which will give me full credit. However; I sure do want to earn the extra credit, but I'm also curious what additional optimizations I can make that I might be overlooking?
Edit #1 (Adding outer loop)
srand(time(NULL));
for(j = 0; j < ARRAY_SIZE; j++) {
x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
array[j] = x;
checksum += x;
}
for (i = 0; i < N_TIMES; i++) {
// inner loop goes here
if (sum != checksum)
printf("Checksum error!\n");
sum = 0;
}
Upvotes: 5
Views: 1718
Reputation: 6809
I'm going to assume you're on x86, if you're not most of this will still apply but the details differ.
_aligned_malloc
or regular malloc
+ manual alignment. Besides that all you'll need in this case is _mm_add_epi32
to do four additions at the same time. (Different architectures have different SIMD units so check yours).All in all, I guess that 1 & 2 are easiest and most feasible and will gain you more than enough performance (eg. 8x on a Core 2 Duo). However, it all comes down to knowing your hardware and programming PIC will require completely different optimizations (eg. instruction level manual pipelining) than a general PC will.
Upvotes: 2
Reputation: 43688
If the array values don't change, you could memoize the sum (i.e. calculate it on first run, and use the calculated sum on subsequent runs).
Upvotes: 0
Reputation: 5640
Some nice optimization tricks:
So if you would use arrays, try to use:
register int idx = ARRAY_SIZE - 1;
register int sum = 0;
do {
sum += array[idx];
} while (idx-- % 10 != 0);
do {
sum += array[idx] + array[idx - 1] + array[idx - 2] + array[idx - 3] + array[idx - 4] + array[idx - 5] + array[idx - 6] + array[idx - 7] + array[idx - 8] + array[idx - 9];
} while (idx -= 10);
// now we don't use a comparison and the ZERO flag will be set in FLAG
// register on which we can conditional jump. With a comparison you do VALUE - VALUE
// and then check if the ZERO flag is set or the NEGATIVE flag or whatever you are testing on
Upvotes: -1
Reputation: 15114
Try to align the array on a page boundary ( i.e. 4K )
Try to compute with a wider data type, i.e. 64 bit- instead of 32-bit integers. This way you can add 2 numbers at once. As the final step add up the both halves.
Convert part of the array or the computation to floating point, so you can use FPU and CPU in parallel
I don't expect the following suggestions to be allowed but I mention them anyway
Upvotes: 1
Reputation: 80761
you could try to store your variables in CPU register with :
register int *unroll_limit = array + (ARRAY_SIZE - (ARRAY_SIZE % 10));
register int *end = array + ARRAY_SIZE;
register int *curr;
and try with different size of manual loops to check when you maximize cache usage.
Upvotes: 3