Sheng
Sheng

Reputation: 13

Unrolling nested for loop in C

I am new to unrolling loop. I am trying the optimize it by unrolling but I am not sure how to approach this. It would be a great help if someone could show me a simple way unroll it. Thanks!

 for(k=0;k<SIZE;k++){
     X_Real[k] = X_Imag[k] = 0.0;

           for(n=0;n<SIZE;n++){
               ph1 = ((twoPI*k*n)/SIZE);
               X_Real[k]=X_Real[k]+Signal1[n]*cos(ph1);
               X_Imag[k]=X_Imag[k]+Signal1[n]*sin(ph1);
           }

           X[k] = sqrt((X_Real[k]*X_Real[k])+(X_Imag[k]*X_Imag[k]));
       }

The SIZE would be flexible depending on the data point I use which is either 1000 or 1 million.

Upvotes: 0

Views: 351

Answers (2)

I am new to unrolling loop. I am trying the optimize it by unrolling

Leave such optimizations to your C compiler.

If your compiler is GCC:

  • read carefully the documentation about invoking GCC
  • try several compiler options related to optimizations, and benchmark your program. If on Linux, read time(7)
  • compile your C code (in file foo.c) with gcc -Wall -Wextra -O3 -fverbose-asm -S foo.c and look into the emitted foo.s assembly file
  • if these optimizations are not enough, write your GCC plugin to optimize even more. You'll need to understand GIMPLE (a central internal representation in GCC)

If you can upgrade or change your hardware, or can redesign your software, consider using OpenCL to run some code on your GPGPU


If your compiler is Clang, study its source code and improve it.


Both GCC and Clang are open source compilers for C, you are allowed to improve them.

In all cases, profile your code before spending efforts on manual optimizations

See also the DECODER and CompCert projects, and the Bismon static source code analyzer (they could be adapted to do such optimizations)

Consider also generating C or C++ code, like the RefPerSys project do (or on Linux like manydl.c does also with dlopen(3) and dlsym(3) on the generated plugin), and like explained in this blog. Or generate machine code at runtime with asmjit or libgccjit. You could use partial evaluation methods and memoization techniques.

For large data, consider transforming your loop into multi-threaded code (on Linux see pthreads(7)...), or OpenACC.

Manual optimizations like the one you consider may change the precision of the result (sometimes, the computed numbers becomes meaningless). See the Fluctuat and CADNA projects, and the floating point guide website.

Read also ACM SIGPLAN conference papers dedicated to optimizations.

NB. It could happen that you'll spend months of efforts to gain less than a few percents of performance. Discuss with your manager/boss/client if it is worth that effort. Be also aware of theoretical limitations related to the Rice's theorem.

Upvotes: 1

sidcoder
sidcoder

Reputation: 460

Example of just the inner loop (n-loop) is shown below (regardless if useful):

   for(n=0;n<SIZE-4;n+=4){

       ph1 = ((twoPI*k*n)/SIZE);
       X_Real[k]=X_Real[k]+Signal1[n]*cos(ph1);
       X_Imag[k]=X_Imag[k]+Signal1[n]*sin(ph1);
 
       ph1 = ((twoPI*k*(n+1))/SIZE);
       X_Real[k]=X_Real[k]+Signal1[(n+1)]*cos(ph1);
       X_Imag[k]=X_Imag[k]+Signal1[(n+1)]*sin(ph1);
      
       ph1 = ((twoPI*k*(n+2))/SIZE);
       X_Real[k]=X_Real[k]+Signal1[(n+2)]*cos(ph1);
       X_Imag[k]=X_Imag[k]+Signal1[(n+2)]*sin(ph1);
      
       ph1 = ((twoPI*k*(n+3))/SIZE);
       X_Real[k]=X_Real[k]+Signal1[(n+3)]*cos(ph1);
       X_Imag[k]=X_Imag[k]+Signal1[(n+3)]*sin(ph1);
       
       stored=n+3;
   }

   for(n=stored+1;n<SIZE;n++){
       ph1 = ((twoPI*k*n)/SIZE);
       X_Real[k]=X_Real[k]+Signal1[n]*cos(ph1);
       X_Imag[k]=X_Imag[k]+Signal1[n]*sin(ph1);
   }

COMMENT:

  • Code optimization is a huge topic. Compiler optimization (with all best flags) is said to usually be better than naive manual (human) code optimization.
  • Anyway, also not a reason, never to take "it" into account. ("it" refers to manual loop unrolling or also manual code optimization).

Just a further idea, something like below (can still be improved), and then (always) perform 4 multiplications by ONE intrinsic instruction, with using _mm256_mul_pd. (... and then you can unroll again to n+=16). Anyway, here, sin and cos will "drag down" performance too much. Maybe, you find a fast vectorized version (of sin and cos), or if the range of k and n (==SIZE?!) is always the same, you can precompute and provide it in an optimal order (to avoid cache misses). To come back to the topic, "Loop unrolling" can prepare for a vectorized version of the code. Anyway further aspects are out-of-scope (and sin & cos are the bottle-neck here).

  v=(twoPI*k)/(double)SIZE; 
  for(n=0;n<SIZE-4;n+=4){

       ph0 = v*n;
       ph1 = v*(n+1);
       ph2 = v*(n+2);
       ph3 = v*(n+3);

       s0a=Signal1[n]*cos(ph0);
       s0b=Signal1[n]*sin(ph0);

       s1a=Signal1[(n+1)]*cos(ph1);
       s1b=Signal1[(n+1)]*sin(ph1);

       s2a=Signal1[(n+2)]*cos(ph2);
       s2b=Signal1[(n+2)]*sin(ph2);

       s3a=Signal1[(n+3)]*cos(ph3);
       s3b=Signal1[(n+3)]*sin(ph3);

       X_Real[k]+=s0a+s1a+s2a+s3a;
       X_Imag[k]+=s0b+s1b+s2b+s3b;
       
       stored=n+3;
   }

   for(n=stored+1;n<SIZE;n++){
       ph1 = ((twoPI*k*n)/SIZE);
       X_Real[k]=X_Real[k]+Signal1[n]*cos(ph1);
       X_Imag[k]=X_Imag[k]+Signal1[n]*sin(ph1);
   }

Upvotes: 2

Related Questions