Reputation: 3991

How to reach AVX computation throughput for a simple loop

Recently I am working on a numerical solver on computational Electrodynamics by Finite difference method.

The solver was very simple to implement, but it is very difficult to reach the theoretical throughput of modern processors, because there is only 1 math operation on the loaded data, for example:

    #pragma ivdep
    for(int ii=0;ii<Large_Number;ii++)
    { Z[ii]  = C1*Z[ii] + C2*D[ii];}

Large_Number is about 1,000,000, but not bigger than 10,000,000

I have tried to manually unroll the loop and write AVX code but failed to make it faster:

int Vec_Size    = 8;
int Unroll_Num  = 6;
int remainder   = Large_Number%(Vec_Size*Unroll_Num);
int iter        = Large_Number/(Vec_Size*Unroll_Num);
int addr_incr   = Vec_Size*Unroll_Num;

__m256 AVX_Div1, AVX_Div2, AVX_Div3, AVX_Div4, AVX_Div5, AVX_Div6;
__m256 AVX_Z1,   AVX_Z2,   AVX_Z3,   AVX_Z4,   AVX_Z5,   AVX_Z6;

__m256 AVX_Zb = _mm256_set1_ps(Zb);
__m256 AVX_Za = _mm256_set1_ps(Za);
for(int it=0;it<iter;it++)
{
    int addr    = addr + addr_incr;
    AVX_Div1    = _mm256_loadu_ps(&Div1[addr]);     
    AVX_Z1      = _mm256_loadu_ps(&Z[addr]);
    AVX_Z1      = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div1),_mm256_mul_ps(AVX_Za,AVX_Z1));
    _mm256_storeu_ps(&Z[addr],AVX_Z1);

    AVX_Div2    = _mm256_loadu_ps(&Div1[addr+8]);
    AVX_Z2      = _mm256_loadu_ps(&Z[addr+8]);
    AVX_Z2      = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div2),_mm256_mul_ps(AVX_Za,AVX_Z2));
    _mm256_storeu_ps(&Z[addr+8],AVX_Z2);

    AVX_Div3    = _mm256_loadu_ps(&Div1[addr+16]);
    AVX_Z3      = _mm256_loadu_ps(&Z[addr+16]);
    AVX_Z3      = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div3),_mm256_mul_ps(AVX_Za,AVX_Z3));
    _mm256_storeu_ps(&Z[addr+16],AVX_Z3);

    AVX_Div4    = _mm256_loadu_ps(&Div1[addr+24]);
    AVX_Z4      = _mm256_loadu_ps(&Z[addr+24]);
    AVX_Z4      = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div4),_mm256_mul_ps(AVX_Za,AVX_Z4));
    _mm256_storeu_ps(&Z[addr+24],AVX_Z4);

    AVX_Div5    = _mm256_loadu_ps(&Div1[addr+32]);
    AVX_Z5      = _mm256_loadu_ps(&Z[addr+32]);
    AVX_Z5      = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div5),_mm256_mul_ps(AVX_Za,AVX_Z5));
    _mm256_storeu_ps(&Z[addr+32],AVX_Z5);

    AVX_Div6    = _mm256_loadu_ps(&Div1[addr+40]);
    AVX_Z6      = _mm256_loadu_ps(&Z[addr+40]);
    AVX_Z6      = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div6),_mm256_mul_ps(AVX_Za,AVX_Z6));
    _mm256_storeu_ps(&Z[addr+40],AVX_Z6);
}

The above AVX loop is actually a bit slower than the Inter compiler generated code.

The compiler generated code can reach about 8G flops/s, about 25% of the single thread theoretical throughput of a 3GHz Ivybridge processor. I wonder if it is even possible to reach the throughput for the simple loop like this.

Thank you!

Upvotes: 1

Answers (3)

zam

Reputation: 1684

Improving performance for the codes like yours is "well explored" and still popular area. Take a look at dot-product (perfect link provided by Z Boson already) or at some (D)AXPY optimization discussions (https://scicomp.stackexchange.com/questions/1932/are-daxpy-dcopy-dscal-overkills)

In general , key topics to explore and consider applying are:

AVX2 advantage over AVX due to FMA and better load/store ports u-architecture on Haswell
Pre-Fetching. "Streaming stores", "non-temporal stores" for some platforms.
Threading parallelism to reach max sustained bandwidth
Unrolling (already done by you; Intel Compiler is also capable to do that with #pragma unroll (X) ). Not a big difference for "streaming" codes.
Finally deciding what is a set of hardware platforms you want to optimize your code for

Last bullet is particularly important, because for "streaming" and overall memory-bound codes - it's important to know more about target memory-sybsystems; for example, with existing and especially future high-end HPC servers (2nd gen Xeon Phi codenamed Knights Landing as an example) you may have very different "roofline model" balance between bandwidth and compute, and even different techniques than in case of optimizing for average desktop machine.

Upvotes: 2

PhD AP EcE

Reputation: 3991

I think the real bottleneck is that there are 2 load and 1 store instructions for every 2 multiplication and 1 addition. Maybe the calculation is memory bandwidth limited. Every element requires transfer 12 bytes of data, and if 2G elements are processed every second (which is 6G flops) that is 24GB/s data transfer, reaching the theoretical bandwidth of ivy bridge. I wonder if this argument holds and there is indeed no solution to this problem.

The reason why I am answering to my own question is to hope someone can correct me before I easily give up the optimization. This simple loop is EXTREMELY important for many scientific solvers, it is the backbone of finite element and finite difference method. If one cannot even feed one processor because the computation is memory bandwith limited, why bother with multicore? A high bandwith GPU or Xeon Phi should be better solutions.

Upvotes: 1

Hadi Brais

Reputation: 23719

Are you sure that 8 GFLOPS/s is about 25% of the maximum throughput of a 3 GHz Ivybridge processor? Let's do the calculations.

Every 8 elements require two single-precision AVX multiplications and one AVX addition. An Ivybridge processor can only execute one 8-wide AVX addition and one 8-wide AVX multiplication per cycle. Also since the addition is dependent on the two multiplications, then 3 cycles are required to process 8 elements. Since the addition can be overlapped with the next multiplication, we can reduce this to 2 cycles per 8 elements. For one billion elements, 2*10^9/8 = 10^9/4 cycles are required. Considering 3 GHz clock, we get 10^9/4 * 10^-9/3 = 1/12 = 0.08 seconds. So the maximum theoretical throughput is 12 GLOPS/s and the compiler-generated code is reaching 66%, which is fine.

One more thing, by unrolling the loop 8 times, it can be vectorized efficiently. I doubt that you'll gain any significant speed up if you unroll this particular loop more than that, especially more than 16 times.

Upvotes: 1

How to reach AVX computation throughput for a simple loop

Answers (3)

Related Questions