JMP
JMP

Reputation: 7834

Manually optimize a nested loop

I'm working on a homework assignment where I must manually optimize a nested loop (my program will be compiled with optimizations disabled). The goal of the assignment is to run the entire program in less than 6 seconds (extra credit for less than 4.5 seconds).

I'm only allowed to change a small block of code, and the starting point is such:

    for (j=0; j < ARRAY_SIZE; j++) {
        sum += array[j];
    }

Where ARRAY_SIZE is 9973. This loop is contained within another loop that is run 200,000 times. This particular version runs in 16 seconds.

What I've done so far is change the implementation to unroll the loop and use pointers as my iterator:

(These declarations are not looped over 200,000 times)

 register int unroll_length = 16;
 register int *unroll_end = array + (ARRAY_SIZE - (ARRAY_SIZE % unroll_length));
 register int *end = array + (ARRAY_SIZE -1);
 register int *curr_end;

curr_end = end;

while (unroll_end != curr_end) {
 sum += *curr_end;
 curr_end--;
}

do {
 sum += *curr_end + *(curr_end-1) + *(curr_end-2) + *(curr_end-3) +
  *(curr_end-4) + *(curr_end-5) + *(curr_end-6) + *(curr_end-7) +
  *(curr_end-8) + *(curr_end-9) + *(curr_end-10) + *(curr_end-11) +
  *(curr_end-12) + *(curr_end-13) + *(curr_end-14) + *(curr_end-15);
}
while ((curr_end -=  unroll_length) != array);

sum += *curr_end;

Using these techniques, I was able to get the execution down to 5.5 seconds, which will give me full credit. However; I sure do want to earn the extra credit, but I'm also curious what additional optimizations I can make that I might be overlooking?

Edit #1 (Adding outer loop)

 srand(time(NULL));
 for(j = 0; j < ARRAY_SIZE; j++) {
  x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
  array[j] = x;
  checksum += x;
 }

 for (i = 0; i < N_TIMES; i++) {

  // inner loop goes here

  if (sum != checksum)
   printf("Checksum error!\n");

  sum = 0;

 }

Upvotes: 5

Views: 1718

Answers (5)

Jasper Bekkers
Jasper Bekkers

Reputation: 6809

I'm going to assume you're on x86, if you're not most of this will still apply but the details differ.

  1. Use SIMD/SSE, this will get you a 4x speed increase without much effort, it needs 16-byte aligned data that you can get with _aligned_malloc or regular malloc + manual alignment. Besides that all you'll need in this case is _mm_add_epi32 to do four additions at the same time. (Different architectures have different SIMD units so check yours).
  2. Use multi-threading/ multiple cores in this case it'd be easiest to have each thread sum half the array to a temporary variable and sum those two results when done. This will scale linearly across the number of cores available.
  3. Prefetch to L1 cache; this only works when you've got a huge array and are sure to be able to stress the CPU for at least ~200 cycles (eg. a roundtrip to main RAM).
  4. Completely go out of your way to optimize the hell out of it and use a GPU based approach. This will require you to set up a CUDA or OpenCL environment and upload the array to the GPU. This is about ~400 LoC excluding the compute kernel. But might not be worth it if you have a small dataset (eg. too much overhead in setting up/tearing down) or if you have a huge changing dataset (eg. too much time spend in streaming to the GPU).
  5. Align to page boundaries to prevent page-faults (expensive) on windows these are usually 4K in size.
  6. Manually unroll the loop while taking into account dual issuing instructions and instruction latencies. This information is available from your CPU manufacturer (Intel provides these too). But on x86 this isn't really useful because of it's CPUs out of order execution.
  7. Depending on your platform actually getting the data to the CPU for processing is the slowest part (this is mainly true for recent consoles & PS, I've never developed for small embedded devices) so you'll want to optimize for that. Tricks like iterating backwards are nice on a 6502 when cycles were the bottleneck but these days you'll want to access RAM linearly.
  8. If you do happen to be on a machine with fast RAM (eg. NOT PC/Consoles), converting from the plain array to a more fancy data-structure (eg. one that does more pointer chasing) might totally be worth it.

All in all, I guess that 1 & 2 are easiest and most feasible and will gain you more than enough performance (eg. 8x on a Core 2 Duo). However, it all comes down to knowing your hardware and programming PIC will require completely different optimizations (eg. instruction level manual pipelining) than a general PC will.

Upvotes: 2

ninjalj
ninjalj

Reputation: 43688

If the array values don't change, you could memoize the sum (i.e. calculate it on first run, and use the calculated sum on subsequent runs).

Upvotes: 0

DipSwitch
DipSwitch

Reputation: 5640

Some nice optimization tricks:

  • make your loop count backwards from ARRAY_SIZE to 0 so that way you can remove the comparisons from your code. Less comparisons speed up the program.
  • Furthermore x86 nowadays are optimized for short loops which they can "preload" to run faster then normal.
  • Try to use registers wherever possible
  • Use pointers instead of array indices

So if you would use arrays, try to use:

register int idx = ARRAY_SIZE - 1;
register int sum = 0;
do {
    sum += array[idx];
} while (idx-- % 10 != 0);

do {
    sum += array[idx] + array[idx - 1] + array[idx - 2] + array[idx - 3] + array[idx - 4] + array[idx - 5] + array[idx - 6] + array[idx - 7] + array[idx - 8] + array[idx - 9];
} while (idx -= 10);
// now we don't use a comparison and the ZERO flag will be set in FLAG
// register on which we can conditional jump. With a comparison you do VALUE - VALUE
// and then check if the ZERO flag is set or the NEGATIVE flag or whatever you are testing on

Upvotes: -1

Peter G.
Peter G.

Reputation: 15114

  • Try to align the array on a page boundary ( i.e. 4K )

  • Try to compute with a wider data type, i.e. 64 bit- instead of 32-bit integers. This way you can add 2 numbers at once. As the final step add up the both halves.

  • Convert part of the array or the computation to floating point, so you can use FPU and CPU in parallel

  • I don't expect the following suggestions to be allowed but I mention them anyway

    • Multithreading
    • Specialized CPU-Instructions, i.e. SSE

Upvotes: 1

C&#233;dric Julien
C&#233;dric Julien

Reputation: 80761

you could try to store your variables in CPU register with :

register int *unroll_limit = array + (ARRAY_SIZE - (ARRAY_SIZE % 10));
register int *end = array + ARRAY_SIZE;
register int *curr;

and try with different size of manual loops to check when you maximize cache usage.

Upvotes: 3

Related Questions