Reputation: 13
I am new to unrolling loop. I am trying the optimize it by unrolling but I am not sure how to approach this. It would be a great help if someone could show me a simple way unroll it. Thanks!
for(k=0;k<SIZE;k++){
X_Real[k] = X_Imag[k] = 0.0;
for(n=0;n<SIZE;n++){
ph1 = ((twoPI*k*n)/SIZE);
X_Real[k]=X_Real[k]+Signal1[n]*cos(ph1);
X_Imag[k]=X_Imag[k]+Signal1[n]*sin(ph1);
}
X[k] = sqrt((X_Real[k]*X_Real[k])+(X_Imag[k]*X_Imag[k]));
}
The SIZE would be flexible depending on the data point I use which is either 1000 or 1 million.
Upvotes: 0
Views: 351
Reputation: 1
I am new to unrolling loop. I am trying the optimize it by unrolling
If your compiler is GCC:
foo.c
) with gcc -Wall -Wextra -O3 -fverbose-asm -S foo.c
and look into the emitted foo.s
assembly fileIf you can upgrade or change your hardware, or can redesign your software, consider using OpenCL to run some code on your GPGPU
If your compiler is Clang, study its source code and improve it.
Both GCC and Clang are open source compilers for C, you are allowed to improve them.
See also the DECODER and CompCert projects, and the Bismon static source code analyzer (they could be adapted to do such optimizations)
Consider also generating C or C++ code, like the RefPerSys project do (or on Linux like manydl.c does also with dlopen(3) and dlsym(3) on the generated plugin), and like explained in this blog. Or generate machine code at runtime with asmjit or libgccjit. You could use partial evaluation methods and memoization techniques.
For large data, consider transforming your loop into multi-threaded code (on Linux see pthreads(7)...), or OpenACC.
Manual optimizations like the one you consider may change the precision of the result (sometimes, the computed numbers becomes meaningless). See the Fluctuat and CADNA projects, and the floating point guide website.
NB. It could happen that you'll spend months of efforts to gain less than a few percents of performance. Discuss with your manager/boss/client if it is worth that effort. Be also aware of theoretical limitations related to the Rice's theorem.
Upvotes: 1
Reputation: 460
Example of just the inner loop (n-loop) is shown below (regardless if useful):
for(n=0;n<SIZE-4;n+=4){
ph1 = ((twoPI*k*n)/SIZE);
X_Real[k]=X_Real[k]+Signal1[n]*cos(ph1);
X_Imag[k]=X_Imag[k]+Signal1[n]*sin(ph1);
ph1 = ((twoPI*k*(n+1))/SIZE);
X_Real[k]=X_Real[k]+Signal1[(n+1)]*cos(ph1);
X_Imag[k]=X_Imag[k]+Signal1[(n+1)]*sin(ph1);
ph1 = ((twoPI*k*(n+2))/SIZE);
X_Real[k]=X_Real[k]+Signal1[(n+2)]*cos(ph1);
X_Imag[k]=X_Imag[k]+Signal1[(n+2)]*sin(ph1);
ph1 = ((twoPI*k*(n+3))/SIZE);
X_Real[k]=X_Real[k]+Signal1[(n+3)]*cos(ph1);
X_Imag[k]=X_Imag[k]+Signal1[(n+3)]*sin(ph1);
stored=n+3;
}
for(n=stored+1;n<SIZE;n++){
ph1 = ((twoPI*k*n)/SIZE);
X_Real[k]=X_Real[k]+Signal1[n]*cos(ph1);
X_Imag[k]=X_Imag[k]+Signal1[n]*sin(ph1);
}
COMMENT:
Just a further idea, something like below (can still be improved), and then (always) perform 4 multiplications by ONE intrinsic instruction, with using _mm256_mul_pd. (... and then you can unroll again to n+=16). Anyway, here, sin and cos will "drag down" performance too much. Maybe, you find a fast vectorized version (of sin and cos), or if the range of k and n (==SIZE?!) is always the same, you can precompute and provide it in an optimal order (to avoid cache misses). To come back to the topic, "Loop unrolling" can prepare for a vectorized version of the code. Anyway further aspects are out-of-scope (and sin & cos are the bottle-neck here).
v=(twoPI*k)/(double)SIZE;
for(n=0;n<SIZE-4;n+=4){
ph0 = v*n;
ph1 = v*(n+1);
ph2 = v*(n+2);
ph3 = v*(n+3);
s0a=Signal1[n]*cos(ph0);
s0b=Signal1[n]*sin(ph0);
s1a=Signal1[(n+1)]*cos(ph1);
s1b=Signal1[(n+1)]*sin(ph1);
s2a=Signal1[(n+2)]*cos(ph2);
s2b=Signal1[(n+2)]*sin(ph2);
s3a=Signal1[(n+3)]*cos(ph3);
s3b=Signal1[(n+3)]*sin(ph3);
X_Real[k]+=s0a+s1a+s2a+s3a;
X_Imag[k]+=s0b+s1b+s2b+s3b;
stored=n+3;
}
for(n=stored+1;n<SIZE;n++){
ph1 = ((twoPI*k*n)/SIZE);
X_Real[k]=X_Real[k]+Signal1[n]*cos(ph1);
X_Imag[k]=X_Imag[k]+Signal1[n]*sin(ph1);
}
Upvotes: 2