Reputation: 291
A critical part of my code has the following two loops. The first one is to multiply complex vector B
(dimensions: N) with complex matrix out1
(dimensions: NxJ) and store the result in inc
(dimensions: NxJ). The second loop converts the complex matrix out2
(dimensions: NxJ) into amplitude and phase parts and stores it consecutively in t
(dimensions Nx2J). inc
, out1
, out1
, and B
are all of type fftw_complex
(2 double
) while t
is a float.
for (int i = 0; i < N * J; i++)
{
k = i % N;
inc[i][REAL] = out1[k][REAL] * B[i][REAL] - out1[k][IMAG] * B[i][IMAG];
inc[i][IMAG] = out1[k][REAL] * B[i][IMAG] + out1[k][IMAG] * B[i][REAL];
}
for (int i = 0; i < N * J; i++)
{
t[i] = (float) sqrt(out2[i][REAL] * out2[i][REAL]
+ out2[i][IMAG] * out2[i][IMAG]);
t[N * J + i] = (float) atan2(out2[i][IMAG], out2[i][REAL]);
}
When compiled with: -Ofast -ftree-vectorize -fopt-info-vec-missed
-mavx2 -msse4
, the output for loop 1 is:
note: not vectorized: not suitable for gather load _50 = *_49[0];
note: bad data references.
note: not vectorized: not enough data-refs in basic block.
note: not consecutive access _50 = *_49[0];
note: Build SLP failed: unrolling required in basic block SLP
note: not consecutive access _50 = *_49[0];
note: Build SLP failed: unvectorizable statement _50 = *_49[0];
note: Build SLP failed: different interleaving chains in one node _60 = *_49[0];
and the output for loop 2 is:
note: versioning for alias required: can't determine dependence between *_70 and *_84
note: vector alignment may not be reachable
note: virtual phi. skip.
note: num. args = 4 (not unary/binary/ternary op).
note: not ssa-name.
note: use not simple.
note: no array mode for V4DF[2]
note: num. args = 4 (not unary/binary/ternary op).
note: not ssa-name.
note: use not simple.
note: no array mode for V4DF[2]
note: function is not vectorizable.
note: not vectorized: relevant stmt not supported: _85 = atan2 (_75, _73);
note: bad operation or unsupported loop bound.
note: versioning for alias required: can't determine dependence between *_70 and *_84
note: vector alignment may not be reachable
note: virtual phi. skip.
note: num. args = 4 (not unary/binary/ternary op).
note: not ssa-name.
note: use not simple.
note: no array mode for V2DF[2]
note: num. args = 4 (not unary/binary/ternary op).
note: not ssa-name.
note: use not simple.
note: no array mode for V2DF[2]
note: function is not vectorizable.
note: not vectorized: relevant stmt not supported: _85 = atan2 (_75, _73);
note: bad operation or unsupported loop bound.
note: not vectorized: no grouped stores in basic block.
I have observed that these loops are the bottlenecks in my code. How do I vectorize them?
Upvotes: 2
Views: 1480
Reputation: 30817
My compilable version of the code is
#include <math.h>
typedef double complex[2];
static const int REAL = 0;
static const int IMAG = 1;
void loop1(int N, int J, const complex B[], const complex out1[], complex inc[])
{
const int NJ = N * J;
for (int i = 0; i < NJ; ++i) {
const int k = i % N;
inc[i][IMAG] = out1[k][REAL] * B[i][IMAG] + out1[k][IMAG] * B[i][REAL];
inc[i][REAL] = out1[k][REAL] * B[i][REAL] - out1[k][IMAG] * B[i][IMAG];
}
}
void loop2(int N, int J, float t[], const complex out2[])
{
const int NJ = N * J;
float *const p = t + NJ;
for (int i = 0; i < NJ; ++i) {
/*t[i] = (float) hypot(out2[i][REAL], out2[i][IMAG]);*/
t[i] = (float) sqrt(out2[i][REAL] * out2[i][REAL] + out2[i][IMAG] * out2[i][IMAG]);
p[i] = (float) atan2(out2[i][IMAG], out2[i][REAL]);
}
}
For the first loop, I get:
42504487.c:10:5: note: not vectorized: not suitable for gather load _16 = *_15[0];
42504487.c:10:5: note: bad data references.
42504487.c:10:5: note: not vectorized: not enough data-refs in basic block.
42504487.c:15:1: note: not vectorized: not enough data-refs in basic block.
42504487.c:10:5: note: Two or more load stmts share the same dr.
42504487.c:10:5: note: Two or more load stmts share the same dr.
42504487.c:10:5: note: Build SLP failed: unrolling required in basic block SLP
42504487.c:10:5: note: Two or more load stmts share the same dr.
42504487.c:10:5: note: Two or more load stmts share the same dr.
42504487.c:10:5: note: can't determine dependence between *_11[1] and *_15[1]
For the second loop, I get:
42504487.c:21:5: note: versioning for alias required: can't determine dependence between *_13 and *_25
42504487.c:21:5: note: vector alignment may not be reachable
42504487.c:21:5: note: virtual phi. skip.
42504487.c:21:5: note: num. args = 4 (not unary/binary/ternary op).
42504487.c:21:5: note: not ssa-name.
42504487.c:21:5: note: use not simple.
42504487.c:21:5: note: no array mode for V4DF[2]
42504487.c:21:5: note: num. args = 4 (not unary/binary/ternary op).
42504487.c:21:5: note: not ssa-name.
42504487.c:21:5: note: use not simple.
42504487.c:21:5: note: no array mode for V4DF[2]
42504487.c:21:5: note: function is not vectorizable.
42504487.c:21:5: note: not vectorized: relevant stmt not supported: _26 = atan2 (_19, _17);
42504487.c:21:5: note: bad operation or unsupported loop bound.
42504487.c:21:5: note: versioning for alias required: can't determine dependence between *_13 and *_25
42504487.c:21:5: note: vector alignment may not be reachable
42504487.c:21:5: note: virtual phi. skip.
42504487.c:21:5: note: num. args = 4 (not unary/binary/ternary op).
42504487.c:21:5: note: not ssa-name.
42504487.c:21:5: note: use not simple.
42504487.c:21:5: note: no array mode for V2DF[2]
42504487.c:21:5: note: num. args = 4 (not unary/binary/ternary op).
42504487.c:21:5: note: not ssa-name.
42504487.c:21:5: note: use not simple.
42504487.c:21:5: note: no array mode for V2DF[2]
42504487.c:21:5: note: function is not vectorizable.
42504487.c:21:5: note: not vectorized: relevant stmt not supported: _26 = atan2 (_19, _17);
42504487.c:21:5: note: bad operation or unsupported loop bound.
42504487.c:21:5: note: not vectorized: not enough data-refs in basic block.
42504487.c:26:1: note: not vectorized: not enough data-refs in basic block.
42504487.c:21:5: note: not vectorized: no grouped stores in basic block.
Here, we have "no array mode for V4DF[2]" and "no array mode for V2DF[2]", suggesting we don't have suitable types for the vectorisation.
Also, "relevant stmt not supported: atan2" tells us that there isn't a vector implementation of atan2
.
At this point, if there are enough cores available, I'd look to OpenMP instead, perhaps using -floop-parallelize-all
.
Upvotes: 1