drjrm3
drjrm3

Reputation: 4718

vectorized sum in Fortran

I am compiling my Fortran code using gfortran and -mavx and have verified that some instructions are vectorized via objdump, but I'm not getting the speed improvements that I was expecting, so I want to make sure the following argument is being vectorized (this single instruction is ~50% of the runtime).

I know that some instructions can be vectorized, while others cannot, so I want to make sure this can be:

sum(A(i1:i2,ir))

Again, this single line takes about 50% of the runtime since I am doing this over a very large matrix. I can give more information on why I am doing this, but suffice it to say that it is necessary, though I can restructure the memory if necessary (for example, I could do the sum as sum(A(ir,i1:i2)) if that could be vectorized instead.

Is this line being vectorized? How can I tell? How do I force vectorization if it is not being vectorized?

EDIT: Thanks to the comments, I now realize that I can check on the vectorization of this summation via -ftree-vectorizer-verbose and see that this is not vectorizing. I have restructured the code as follows:

tsum = 0.0d0
tn = i2 - i1 + 1
tvec(1:tn) = A(i1:i2, ir)
do ii = 1,tn
    tsum = tsum + tvec(ii)
enddo

and this ONLY vectorizes when I turn on -funsafe-math-optimizations, but I do see another 70% speed increase due to vectorization. The question still holds: Why does sum(A(i1:i2,ir)) not vectorize and how can I get a simple sum to vectorize?

Upvotes: 7

Views: 1125

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 363922

Your explicit loop version still does the FP adds in a different order than a vectorized version would. A vector version uses 4 accumulators, each one getting every 4th array element.

You could write your source code to match what a vector version would do:

tsum0 = 0.0d0
tsum1 = 0.0d0
tsum2 = 0.0d0
tsum3 = 0.0d0
tn = i2 - i1 + 1
tvec(1:tn) = A(i1:i2, ir)
do ii = 1,tn,4   ! count by 4
    tsum0 = tsum0 + tvec(ii)
    tsum1 = tsum1 + tvec(ii+1)
    tsum2 = tsum2 + tvec(ii+2)
    tsum3 = tsum3 + tvec(ii+3)
enddo

tsum = (tsum0 + tsum1) + (tsum2 + tsum3)

This might vectorize without -ffast-math.

FP add has multi-cycle latency, but one or two per clock throughput, so you need the asm to use multiple vector accumulators to saturate the FP add unit(s). Skylake can do two FP adds per clock, with latency=4. Previous Intel CPUs do one per clock, with latency=3. So on Skylake, you need 8 vector accumulators to saturate the FP units. And of course they have to be 256b vectors, because AVX instructions are as fast but do twice as much work as SSE vector instructions.

Writing the source with 8 * 8 accumulator variables would be ridiculous, so I guess you need -ffast-math, or an OpenMP pragma that tells the compiler different orders of operations are ok.

Explicitly unrolling your source means you have to handle loop counts that aren't a multiple of the vector width * unroll. If you put restrictions on things, it can help the compiler avoid generating multiple versions of the loop or extra loop setup/cleanup code.

Upvotes: 0

drjrm3
drjrm3

Reputation: 4718

It turns out that I am not able to make use of the vectorization unless I include -ffast-math or -funsafe-math-optimizations.

The two code snippets I played with are:

tsum = 0.0d0
tvec(1:n) = A(i1:i2, ir)
do ii = 1,n
    tsum = tsum + tvec(ii)
enddo

and

tsum = sum(A(i1:i2,ir))

and here are the times I get when running the first code snippet with different compilation options:

10.62 sec ... None
10.35 sec ... -mtune=native -mavx
 7.44 sec ... -mtune-native -mavx -ffast-math
 7.49 sec ... -mtune-native -mavx -funsafe-math-optimizations

Finally, with these same optimizations, I am able to vectorize tsum = sum(A(i1:i2,ir)) to get

 7.96 sec ... None
 8.41 sec ... -mtune=native -mavx
 5.06 sec ... -mtune=native -mavx -ffast-math
 4.97 sec ... -mtune=native -mavx -funsafe-math-optimizations

When we compare sum and -mtune=native -mavx with -mtune=native -mavx -funsafe-math-optimizations, it shows a ~70% speedup. (Note that these were only run once each - before we publish we will do true benchmarking on multiple runs).

I do take a small hit though. My values change slightly when I use the -f options. Without them, the errors for my variables (v1, v2) are :

v1 ... 5.60663e-15     9.71445e-17     1.05471e-15
v2 ... 5.11674e-14     1.79301e-14     2.58127e-15

but with the optimizations, the errors are :

v1 ... 7.11931e-15     5.39846e-15     3.33067e-16
v2 ... 1.97273e-13     6.98608e-14     2.17742e-14

which indicates that there truly is something different going on.

Upvotes: 1

Related Questions