Gain for runtimes as a function of different possible vectorizations

Question

I am doing tests with gcc-4.9 on a simple example for studying vectorization (my little code computes the sum of double 2 arrays and store results into output array).

From what I have seen on web, there seems to exist :

SSE vectorization (128 bits = 16 bytes = 2*sizeof(double)
AVX vectorization (256 bits = 32 bytes = 4*sizeof(double)
AVX2 vectorization (512 bits = 64 bytes = 8*sizeof(double)

My issue is that in three above cases, I always get a gain (between no-vectorized and vectorized versions) roughly equal to 2 (quite a mean gain of 1.7).

I think that I don'use the good compilation options. Here what I did :

For SSE : gcc-mp-4.9 -std=c99 -Wa,-q -O3 -march=native -ftree-vectorize -fopt-info-vec main.c
For AVX : gcc-mp-4.9 -std=c99 -Wa,-q -O3 -march=corei7-avx -ftree-vectorize -fopt-info-vec main.c
For AVX2 : gcc-mp-4.9 -std=c99 -Wa,-q -O3 -march=core-avx2 -ftree-vectorize -fopt-info-vec main.c

When I run this 3 cases, I always get a factor around 2 whereas I expect to reach a factor 4 for AVX and a factor 8 for AVX2.

Processor on my MacBook pro is : Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz

Anyone could tell me the different flags to active AVX and AVX2 vectorization ?

Maybe, my corei7 doesn't support these vectorizations (just SSE ?).

Thanks for your help.

cdcdcd · Accepted Answer

Assuming you've implemented the necessary unrolling and the right packing calls, then it is likely that the issue here could be memory related:

1) You'll be hammering the cache a little harder as result of the larger blocks of memory you need to exploit the more generous packing.

2) You may need to help the compiler out here by telling that you want your data to be 32byte aligned (this will help with optimisation). Look up "#pragma vector aligned" it may or may not help.

3) There may also be an overhead if your array size is not a multiple of the packing - so for AVX2 this would be a multiple of 8. Some time may be spent in the "remainder" loop (but this should be a relatively small overhead).

4) Try reducing the degree of optimisation to -O2. Sometimes the more you tell the compiler to take charge of the less efficient your code can become.

But again you'll probably hit a "cache-efficiency" issue with larger packing operations (you'll likely be moving from L1 to L2).

Gain for runtimes as a function of different possible vectorizations

Answers (1)

Related Questions